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THE PSYCHOMETRIC SOCIETY—ROOTS AND POWERS* 


JACK W. DUNLAP 
UNIVERSITY OF ROCHESTER 


A classical function familiar to all psychologists is the expression 
S>R, commonly called the S—R bond, or stimulus-response. This 
function might well be the motto of this Society since one of the pri- 
mary concerns of the Society’s members is that of prediction and con- 
trol; that is, given a particular stimulus or set of stimuli, what is the 
most likely expectation as to the response or result. These symbols 
are particularly significant when one examines them in connection 
with the Society, for it leads one to a consideration of the causes that 
led to its organization and to its probable effect on psychology in the 
* future. 

The Gestaltist might interpret the formation of the Society as an 
illustration of the principle of closure, but, while this is a perfectly 
reasonable rationalization of the fait accompli, it seems to have a tone 
of finality and completeness which is not at all in keeping with the 
possibilities of the future for the Psychometric Society. For this rea- 
son, I have chosen the Thorndikian rather than the Gestalt principle 
to exemplify our Society. 

Psychology as a formal study is young as disciplines go, and, as 
a science is barely through its birth pains. In considering this science 
one cannot but remember James’ famous remark about the new-born 
child being in a state of “blooming, buzzing confusion’ and realizing 
that psychology is just beginning to attain some order in its domain. 
This state of confusion is not peculiar to psychology or to psycholo- 
gists but is natural in any new field of human endeavor. The original 
approaches in any science are based on common experiences, philo- 
sophical speculations, and inherited superstitions, and only with exas- 
perating slowness can a body of quantitative data based on controlled 
observations be secured from which it is possible to develop rational 
hypotheses. 

The task confronting the early workers in the field of psychology 
was tremendous, and there is no obvious and easy way for us to evalu- 
ate the contributions to quantitative rational psychology of such men 


* Presidential Address, Psychometric Society, September 4, 1941, Northwest- 
ern University. 


1 








2 PSYCHOMETRIKA 


as Bacon and Galton in experimental design, of Weber and Fechner 
in psychophysics, of Lashley and Rashevsky in physiological psychol- 
ogy, of Binet, Terman, and Thorndike in mental measurements, of 
Spearman and Thomson in quantifying theories of intelligence, of 
Allport in social psychology, of Warden in formulating the dynamics 
of animal behavior, and of Fisher, Kelley, Pearson, and Thurstone in 
the analysis of data, to mention only a few. The efforts of these men 
are not only widely spread over the field of psychology, but they have 
been even more widely scattered in terms of time and geography. This 
long and widespread attempt at quantification is indicative of the need 
for such work. The importance of such attempts is further evidenced 
by the steadily increasing volume of technical literature appearing in 
various periodicals during the first four decades of the current cen- 
tury. 
There can be no question that the stimuli were present, and only 
a catalyst was needed in the form of some individual to implement the 
uniting of widely separated scholars into an articulate and functional 
organization. In 1931 such an individual appeared: Dr. A. P. Horst. 
Horst had a firm conviction that there was a strong and growing in- 
terest in the quantification of our science and that what was needed 
was a medium of publication devoted to this purpose. He believed that 
the quantity and quality of the articles then appearing in widely scat- 
tered sources furnished a sound basis for establishing a journal de- 
voted to the development of psychology on a quantitative rational 
basis. 

In 1931 Horst was attempting to develop or locate a journal that 
would be devoted to quantitative methods as applied to education and 
psychology. The journals that most nearly met this condition were 
the Journal of Educational Psychology and the Journal of the Ameri- 
can Statistical Association. Both of these, however, had other and 
more general purposes to serve than that proposed by Horst. During 
the following years Horst discussed the matter at great length with 
A. K. Kurtz and in 1933 carefully examined the possibilities of such 
a journal with L. L. Thurstone and M. W. Richardson. The idea of 
such a journal appealed strongly to Thurstone, since he was just be- 
ginning to publish his results on factor analysis. Richardson’s inter- 
est in the theoretical problems of test construction guaranteed his sup- 
port of the projected publication. During the latter part of 1933 the 
matter was brought to the attention of the speaker, because of his con- 
nection with the Journal of Educational Psychology and his conse- 
quent knowledge of the quantity of technical material available for 
the support of such a journal. In the spring of 1934 Thurstone went 
over the details of establishing such a journal, and at this time vari- 
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ous methods of financing the journal were considered. Several at- 
tempts were made to interest one of the Foundations in supporting 
the proposed periodical, but all to no avail. Throughout the spring 
and summer of 1934 Horst, Kurtz, Richardson, and Stalnaker were 
working on details as to costs, publishers, policies, and the methodol- 
ogy of editorial management. 

Thus, at the time of the fall meeting of the American Psychologi- 
cal Association at Columbia University, “Psychometrika” was only a 
nebula in a mist of words and wishes. A series of conferences of those 
interested in the project during the week of the meetings crystallized 
the plans, brought the group together as a unit, with the result that 
Psychometrika began to assume form and substance. As a result of 
these conferences the material gathered by different individuals as to 
publishers, costs, sales, style, policy, and editorial management were 
collated, and specific tasks were assigned to particular individuals. It 
was at this time that Kurtz began to emphasize the fact that, if there 
were readers for such a journal, they would be interested, in all like- 
lihood, in forming a society in which their common interest would be 
the keynote. 

The formation of such a society would have many advantages— 
the identifying of individuals with common interests, focusing atten- 
tion on the need and importance of developing a quantitative rational 
psychology, providing a physical meeting where technical papers 
could be read (and perhaps, appreciated), of locating possible con- 
tributors to the journal, and last but not least, if the journal was 
to be the official organ of the society, providing financial support for 
its publication. The only fly in the ointment was that there was no 
idea of how many individuals were interested in such an organization. 
There seemed to be a number of cogent arguments to the effect that 
such an organization would have a greater chance of success if the 
journal, Psychometrika, were to appear, like Minerva from Jove’s 
forehead, full blown before the public immediately after the organi- 
zation of the society. But here a paradoxical situation arose—to have 
the journal it was first necessary to have the society, but to have the 
society it was claimed that one must first have the journal. So the 
matter stood through the fall of 1934 and the spring of 1935. 

The next problem was to determine whether other biometricians, 
educators, psychologists, and statisticians were interested in forming 
such a Society. Thurstone made this possible by the liberal contribu- 
tion of not only his own time and effort but also that of his staff. 
Through the facilities at his command, letters of inquiry were sent 
to a large number of individuals who, it was thought, might be in- 
terested. As a result of this canvass, invitations were extended to all 
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who replied to attend the formation of the Society on September 4, 
1935, at Ann Arbor, Michigan, during the session of the American 
Psychological Association. Temporary officers were appointed for the 
Society, and later in the fall a mail ballot for election of officers was 
held. Dr. L. L. Thurstone was the first president, Dr. Paul Horst, the 
secretary, and the speaker, the treasurer. 

During the following year the constitution committee, composed 
of Horst, Kurtz, and Richardson, prepared the present constitution of 
the Society, which was officially adopted at the second annual meeting 
held at Dartmouth. This was amended in 1937 to include a “student 
membership,” and at present such memberships constitute approxi- 
mately one-sixth of the total membership. 

The growth of the organization has been slow, but, on the other 
hand, the membership has had relatively few withdrawals. The paid 
membership for 1936 included 133 individuals and with the succeed- 
ing years included 185, 189, 214, and in 1940 dropped to 200 paid mem- 
bers. That the membership takes an interest in the affairs of the So- 
ciety is indicated by the fact that approximately forty per cent of the 
eligible members voted in the recent elections. 

In dealing with the historical development of the Society it is im- 
possible to disentangle its history from that of the Psychometric Cor- 
poration. As pointed out above, one fundamental question was how 
to publish the Journal immediately upon the organization of the So- 
ciety. At the Ann Arbor meeting it was voted to have dues of one dol- 
lar a year until the Journal appeared, and thereafter of five dollars 
a year. Thus, there was still no capital for starting the journal. 

Suddenly shortly after the beginning of 1936, Horst became im- 
patient, and with a confidence equalling his foresight, he cut the Gor- 
dian knot by offering to underwrite the losses of the journal for the 
first year up to one-fourth of its cost. A simple but practical solu- 
tion, and an example which was immediately: followed to a lesser 
extent by Kurtz, Thurstone, Richardson, and, so as not to appear too 
niggardly, by Dunlap. Somehow the word got about as to the plans 
for the journal and how it was to be financed initially. It was only a 
short time until pledges of support had been received from Guilford, 
Gulliksen, Kuder, Lorge, Stalnaker, and Thorndike. Suddenly, it was 
realized that sufficient money had been guaranteed to publish the jour- 
nal for at least a year and a half, but offers still came in to help un- 
derwrite the venture. This spontaneous reaction seemed to more than 
justify the attempt to organize the Psychometric Society and to pro- 
ceed with the publication of Psychometrika. 

If the journal were to appear immediately, it was necessary that 
some legal unit be responsible for the financial arrangements. This 
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was the basic reason for the formation of the Psychometric Corpora- 
tion, independent of the Psychometric Society. Another cogent rea- 
son was that the original sponsors felt that during the early years 
the policy of the journal should conform closely with the basic ideas 
of its founders and not degenerate into another periodical devoted to 
publishing only the results of psychological and educational measure- 
ments. It was believed that if an editorial policy was firmly estab- 
lished, the journal would then go on regardless of changes in the com- 
position of the editorial board. If, as time went on, the journal was 
a success, the Corporation would gradually turn over the control of 
Psychometrika to the members of the Society. It was for these rea- 
sons that on August 24, 1936, the Psychometric Corporation was in- 
corporated in the State of Illinois. 

That there was a real need for such a "journal is shown by the 
list of library subscriptions, which has grown to include 78 libraries 
at present, exclusive of those in foreign countries that have dropped 
their subscriptions for the duration of the war. Within the short 
space of two years after its first appearance Psychometrika could be 
found in libraries in Canada, China, England, Austria, France, Ger- 
many, Scotland, South Africa, and a number of countries which now 
are only memories. This growth occurred in spite of the fact that the 
cost to a library was twice that for a private subscription. An un- 
usual innovation was sending to libraries a fresh volume of the jour- 
nal at the end of each year for binding purposes. 

Originally the journal assessed contributors a dollar a page for 
text and tables and for the cost of cuts, but at the Columbus meeting 
in 1938, this charge was reduced to fifty cents a page with no charge 
for cuts. However, at the annual meeting held at Penn State College 
in 1940 this charge also was completely eliminated. The contributor 
has always been furnished gratis with two hundred copies of his ar- 
ticle. 

Another interesting fact about the journal is that each manu- 
script is first examined by the Managing Editor who removes all ref- 
erences to the name of the author. The manuscript is then evaluated 
by three members of the Editorial Board. The Managing Editor col- 
lates the comments on the manuscript and then accepts the article; 
rejects the article; or accepts it, conditional upon revisions suggested 
by the readers. This practice has contributed in no small way to the 
quality and uniform grade of material appearing in the Journal. The 
untiring efforts of the Editor, Dr. M. W. Richardson and of Dr. Doro- 
thy Adkins, Assistant Editor, have been no small factor in the pro- 


duction of the Journal. 
Publication of a journal is not all that the Corporation has done. 
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In 1938 there appeared the first of a series of Psychometric Mono- 
graphs. This first monograph, entitled “Primary Mental Abilities,” 
illustrates with emphasis the type of quantitative rational psychology 
that the Society has tried to develop. In 1939 the Psychometric Cor- 
poration made possible the publication of another journal, the Bulletin 
of Mathematical Biophysics. The Corporation acted as publishing and 
financial agent for the Bulletin and in 1940 turned this publication 
over to the University of Chicago Press. 

Today we are a society on a firm professional and financial basis, 
and we have a journal that has attained a high rank among scientific 
periodicals. It is with no little pleasure that, as a former treasurer of 
->th the Society and the Corporation, I announce that the loans of 
\..e original sponsors have been repaid and that both the Society and 
Corporation have no debts but rather a small but comfortable reserve 
with which to face the contingencies of the future. 

But enough of the past. Let us consider the present and the fu- 
ture of our Society. At each annual meeting the Society has spon- 
sored a program of papers. Last year barely enough papers were sub- 
mitted to form a program, and this year the number of papers sub- 
mitted was so small that it was impossible to arrange a program. What 
does this mean? Is it that our members are not engaged in productive 
scientific work? Is interest in the development of the principles of 
the organization waning? Is the type of program not satisfactory? 
Are our members so busy with affairs of national defense that they 
cannot participate in scientific programs? 

I do not know, but I suspect that not one but all of these reasons 
have contributed to a greater or lesser extent to this unfortunate 
state of affairs. This is an important juncture in the development 
of the Psychometric Society. Shall we abandon our program of pap- 
ers? Shall we have a program consisting of two or three invited pap- 
ers? Or shall we attempt to design an entirely new type of program? 
This is not a task for the President or for that matter, for your offic- 
ers, but this is a task for the total membership. Surely the time has 
not arrived when we can comfortably recline on our advances and 
say, “Psychology is now a quantitative rational science; let us main- 
tain the status quo.” That there is work to be done, almost an un- 
limited amount, seems to be apparent. What must be done is to in- 
duce each member to take a more active part in the affairs of the So- 
ciety and to contribute of his time, energy, and ideas. Your officers 
will welcome any suggestions and be glad to receive your assistance 
in implementing these suggestions into concrete action. 

Perhaps part of the fault lies with our Journal, which has a pre- 
ponderance of material so highly technical that only the specialist can 
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read and understand it. But we, as members, must remember that 
the editors cannot publish material that is not submitted to them. Per- 
sonally, I would like to see more material such as the highly practical 
article of Richardson and Kuder on the “Theory of Estimating Test 
Reliability.” If the members want articles other than those on factor 
analysis or the determination of the order of a matrix, they must 
submit technical manuscripts on other topics. 

The future should see a great many papers of theoretical and 
practical importance in all fields of psychology. That problems abound 
that demand formulation in quantitative and mathematical terms you 
know even better than I. The entire field of quotients needs to be ex- 
amined and stated in more precise and rigorous terms. We have al- 
lowed ourselves to be shackled by the I.Q. with the resulting contro- 
versies about the stability of this function. Millions of words have 
been written on the subject, and thousands of hours have been ex- 
pended in computing and recomputing this function. Curiously, little 
has been done on the basic rationale underlying the function. Thur- 
stone and Thorndike have attempted to develop more satisfactory 
units of measurement, and Heinis has developed a function which has 
received far too little consideration in the discussions of this prob- 
lem. Here indeed, is a field worthy of investigation and restatement. 

The field of item analysis and test construction is just beginning 
to emerge from the labored gropings as to methodology. Recently it 
was my good fortune to examine a manuscript by Horst et al, in which 
a systematic attempt had been made to develop item analysis on a 
rational basis. This work, however, was far from conclusive, and I 
am sure the authors would be the first to disclaim that all the prob- 
lems were solved. 

The general field of prediction, which is being emphasized so 
strongly in this time of national emergency, when it is vital to our 
country and to each of us that each man be placed where he will be 
most effective, is filled with problems demanding our attention. What 
are criteria? How can their validity be established? What is the most 
predictable criterion? What is the minimum number of variables 
from a given matrix which will give valid, reliable, and effective es- 
timation of either a single or a multiple criterion? 

_ The rationale of rating scales has advanced little since the last 
world war, and there is no question but that the pressure of time and 
numbers will again bring such scales to the fore. Here, indeed, is a 
field that should challenge the membership, not only for its theoreti- 
cal implications but also for its practical applications. 

The current personality scales represent a cut-and-try methodol- 
ogy, and only recently has there been any attempt to apply modern 
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methods of analysis to such scales. It is necessary in this field not 
only to apply more rigorous statistical techniques but also to attempt 
to delimit the problems more precisely and to formulate hypotheses 
susceptible to experimental study. 

Whichever way one turns he is confronted with problems in so- 
cial psychology, animal psychology, and in the psychology of person- 
ality, to mention a few, where solutions await the development of a 
rationale that can be subjected to quantitative formulation. I could 
go on with these citations, but why should I mention other fields when 
you are even more familiar with their problems than I? 

The Psychometric Society emerged as a result of a felt need and 
so far has served its purpose admirably. That its services to psychol- 
ogy will be more substantial in the future is not merely a possibility, 
but is a probability of a very high order. The roots of the Society are 
firmly fixed, and its powers, though latent, are just beginning to 
emerge and will, I am confident, be a major force in the psychology 
of tomorrow. 
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ON THE NUMBER OF FACTORS 


QUINN MCNEMAR 
STANFORD UNIVERSITY 


A proposed criterion for the number of factors is developed on 
the basis of the similarity between a factorial residual and the par- 
tial correlation coefficient; something is known concerning the sam- 
pling error of the latter. Instead of computing the residuals as par- 
tials, a formula is presented for adjusting the standard deviation 
of the distribution of residuals so as to approximate the S.D. of the 
residuals as partial correlations. The criterion requires that factors 


be extracted until the adjusted S.D. reaches or falls below 1/VN. 
When tried out on six samples drawn from six universes of known 
factorial description, the criterion indicated the correct number of 
factors each time. The requisites of situations adequate for such 
empirical checks are discussed. 


It is appropriate to begin this paper with a few words regarding 
a priori requisites for an adequate criterion for the number of fac- 
tors. It would not seem unreasonable to require that any proposed 
criterion should exhibit some degree of rationality before being tried 
out. Some of the proposals are rather obviously nonsensical, and 
therefore unworthy of consideration. Let us list a few of these and 
point out, rather categorically, some objections thereto. 

First: the frequency distribution of centroid loadings should be- 
come uni-modal. This can never be adequate because the centroid 
method precludes uni-modality. 

Second: the range of the centroid loadings should be less than 
some arbitrary value, e.g., .30. This gets us nowhere since we would 
then need a criterion for setting up the arbitrary value. 

Third: the curve of some defined function should flatten marked- 
ly. To be useful, such a criterion would need to be fortified with a 
criterion for deciding when a curve has flattened markedly. 

Let us now turn to a few positive suggestions. It seems logical 
to suppose that an adequate criterion will be some function of the 
size of the sample. It would be strange indeed to find a formula con- 
cerned with sampling errors which is not a function of capital N. By 
analogy with the standard error of a distribution of tetrads one 
might expect an adequate criterion to be a function of the number of 
variables. On the practical side, a satisfactory criterion must not 
involve an unreasonable amount of computation. One analytically de- 
rived criterion calls for the value of the determinant of the original 
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correlation matrix. We can foresee some difficulty in computing this 
for a determinant of order 57. 

The first criterion ever proposed was that the analysis should be 
continued until the standard deviation of the residuals becomes less 
than the sampling error of the mean original correlation coefficient. 
This criterion has not proved valid in experimental studies, and it 
does not always work when checked empirically; but because of its 
credibility, we have re-examined it in order to see whether some modi- 
fication might make it acceptable. Since the residuals are highly 
analogous to partial 7’s, one wonders why it should have been assumed 
that their sampling errors should be a function of the magnitude of 
the original correlations. This isn’t the case for partial 7’s, so we sug- 
gest that it would be more logical to require the residual standard 
deviation to fall below the standard error of an r of zero rather than 
that for the mean original r. This would lead to the extraction of fewer 
factors, and thus would tend to exaggerate the known bias in the re- 
sidual criterion. 

But let us shove the analogy with the partial correlation technique 
a bit farther. In computing, say, the first factor residuals, the numera- 
tor term of the formula for partial 7 is being used, whereas the de- 
nominator terms are ignored. If these latter terms were used, the 
residuals would be larger, and they would correspond to partial 7’s. 
Now the standard error of a partial 7 is known, hence the significance 
of the deviation of such residuals from zero could readily be deter- 
mined. Instead of disrupting the centroid process by computing 
residuals as partials, it is possible to adjust the standard deviation of 
the ordinarily obtained residuals so as to approximate closely the 
standard deviation of the distribution of the corresponding partials. 
Then the analysis would be carried to the point at which such an 
adjusted standard deviation is equal to or less than the standard error 


of a zero correlation, i.e., less than 1/\/N . 
Derivation 
Let 7.» be the correlation between tests a and b, and let a,, a, 


a,,--+,b,, b., b;,--- stand for centroid factor loadings. Suppose one 
factor has been extracted, the residual is 


Pab — Tab — a,b, . (1) 


If, however, this residual were computed as a partial correlation we 
would have 
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13 i b, 
Pg, eee (2) 
V1-@, V1-—6', 








Now if a second factor has been extracted, the residual expressed as 
a partial correlation would be of the form 


Tab. — Ta2-1 Tb2-1 


Yab.12 = . (3) 
v1 oy Ts. V1 at T* 50.1 








But 


a2 — TarNr2 
v1 ro oak V1 _: 1710 





To = 





Since 712 = 0, Tao = A , and 74; = 4, , we have 


Ya2-1 a 
Vv1l-@, 
and 
is a 
To2.1 aaa giagigie = ewe 
V1 ee Bb, 
Then (3) becomes 
Tar — a,b, = Ae be 














V1 rr a’, v1 ae b?, Vil eet a, vil at b?, 
. a7, b?, 
af :~ 4/1 tae b, 


Tab —a,b, — a, be 
YT ad.12 = ° (4) 
V1 es a?, = a?, vil ae b, one b?, 


Tab-12 = 





? 








which simplifies to 








By a similar use of partials of lower order, it can be shown that the 
third factor residuals as partials take on the form 


Tab — 4b, — Mab, — a3b; 
V1 —¢#,— a, — @, V1 <0 ry b?, on b?, 





Tab-123 — 





Presumably, with laborious algebra, this could be extended. But 
the use of determinants greatly facilitates the resolution for the gen- 
eral case of s factors. Let the major determinant of the system of 
any two tests and s factors be designated as 
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1 1 G% GM a; a, 
%]%, t b&b & 6, b, 
aq b, 1 0 O 0 
a b 0 1 O 0 
D=\|ia¢ 6 0 0 1 0 
ab. 00 0 -. I 





Then, by formula 266 of Kelley (4, p. 298) 
Dia 
VDea VD 
where the subscripts indicate which row and column have been de- 
leted in defining the minors. Straightforward evaluation of these 


three minors lead to 


(5) 


Tab-123++-8 — 


eo ee a,b, a ab. a! a3b; ate eee a,b, 








Vab-123---8 — 


vl —,—-@,—-@,— + — @, vl jae 6, a b*, ie b*, a eee b*, 
It will be noted that the numerator is the ordinary residual and 
that the denominator is the product of the uniquenesses of the two 





tests. Thus, 
_ _—Pab 
Tab-123++-8 Us» Us 
or 
pis 4 5 j and; Oe 
1 ij.123.--8 = ——— (7) 


Uj Uy i+ j. 

n(n—1) 
2 
residuals as partials, we next seek an expression for the S.D. of the 
distribution of these partial residuals as a function of the ordinary 

residuals and the tests’ uniquenesses. 

Expression (7) defines a variable, the partial residuals for 
n(n—1)/2 intercorrelations, in terms of an index having a variable 
numerator and a denominator which is the product of two variables. 
We desire the S.D. of the distribution of this index as a function of 


In order to avoid the necessity of actually computing the 
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statistics determinable from the variables which enter into the index. 

It should be noted that since the S.D. of the residuals is usually 
computed from a double-entry distribution, the mean of the distribu- 
tion is zero. Similarly the mean of the partial residuals is also zero; 
that is, if X, = pi; and X, = wu; and we define the index of (7) as 


I= X,/X, , then 


1 ~*~... 
nla-1) ~X, 


for the double-entry distribution. The distribution variance will there- 
fore take on the form 


ee ee ay] 
"1 n(n—1) =(¥) : 
The summation includes each partial twice. Obviously, if we work 


from the distribution of absolute values, with each residual entered 
once, we have, letting m = n(n—1) /2, the following 











_1 xX, a. mm tfay? i 1 . t. \* 
“= 2 3(x)= a2 (nee) _— roa ae? : 
But 
Ve — 2 Ls 
aes =1-22 +354 -..., 
whence 











v7; 


1 te, 22h 3H, x, 
ol = —— (8) 


The 2nd, 3rd, and remaining terms in this expression involve correla- 
tions between functions of X, and X.. It seems reasonable to assume 
that these tend to zero, hence we have 


o” 


07; =p, or “= (9) 
The denominator term of (9) involves the mean of X2, 4.¢., the 
mean of the product of the variables, wu; and u;. Let X;—= wu; , X,—=4u,;, 


then 








Xo X3X4 
a= = == 
a =(M, + as) (My + 24) 
a m 


1 
—s [5M;M, + Mx, + M.D>2; + D23%,]; (10) 
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M, = MM, + Yes0304 « 
But in this case M; = M,= M,,, and 73,, the correlation between the 
m pairs of uniquenesses, should vanish (empirical checks tend to veri- 
fy this assumption). Hence 


M,=M?,, (11) 
where M,, is the mean of the uniquenesses for the n tests. 


Since the uniqueness for test 7 is, by definition, u4; = V1 — h*; , we 
next seek to evaluate M, in terms of h. Thus 


ee SE 


u 





n n n 
Expanding (1 — h*;)', we have 
(1—h?;)*=1-—30;,— $b -— why —---. 
Dropping the 6th and higher order terms, 
= (1-43, — fh) 
M, = : 
n 

M,=1-3M,,—%M,,. (12) 
But formula (9) calls for M., which by (11) is equivalent to M?,. 
From (12) 





= 1 Rost 1 1 
Ct La 1+ qui. + 64 ie i yen gM 
aie (13) 
M?,=—-1-—M 2 (approximately). 


Thus, we finally have an approximate value for the S.D. of the partial 
residuals as 
op Tp 


mm, 1—-M.’ 


ha 


Saas (14) 
where o, is the S.D. of the ordinary residual after s factors have been 
extracted, and M,, is the mean communality for s factors. 

We are proposing that when a, reaches or falls below 1/\/N, the 
magnitudes of the residuals are such that their departure from zero 
may be considered as due to chance sampling errors in the original 
intercorrelations. 
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Approximation error 


The approximation error involved in using (14) instead of the 
S.D. which would result if all the individual n(n—1) /2 residuals had 
actually beén computed as partials by use of (6) should, we believe 
be small. Consideration of the simplifying assumptions made in ar- 
riving at (9) and (10), and the nature of the higher order terms 
dropped in obtaining (13) should lead the reader to a similar conclu- 
sion. In order to have some basis for a better appreciation of the er- 
ror involved in using (14), we have made a comparison of the S.D.’s 
yielded by it with the actual values obtained from distributions of 
partial residuals computed by (6). The results for six checks made 
on independent sets of data are set forth in Table 1, from which it 
can be seen that the approximation error is of order .001 in three in- 
stances, .002, .003, and .007. This last figure represents an error of 
about 138% of the actual value, which may or may not be tolerable for 
a given investigator. 


Empirical checks on the criterion 


An empirical check on any criterion should be based on setups 
which not only seem typical of actual experimental situations but 
which are also designed to permit the operation of certain known facts 
concerning the sampling behavior of correlation coefficients. Firstly, 
it must be remembered that the sampling errors in a table of inter- 
correlations are not independent. This means that one is not justified 
in adding errors independently to the several 7’s in a correlation ma- 
trix. In the second place, one must not forget that the sampling er- 
rors of correlation coefficients tend to yield skewed distributions. This 
fact has not been taken into account in the published reports. Indeed, 
one investigator (5) proved by the Chi Squared technique that his 
injected errors were normally distributed, thereby unwittingly dem- 
onstrating that his empirical setup was inadequate. In the third 
place, an empirical study must, of course, involve the drawing of 
really random samples from a universe of known factorial description. 
This, it should be noted, is not the same as adding to individual co- 
efficients chance errors of predetermined variability. The sampling 
unit in this case must be an individual,* not a slip of paper containing 
a so-called chance error. 

In our empirical series, we have met these requisite conditions 
by the use of tables of random numbers. This might also be done by 


* A person or entity for which measurements are available for the variables 
being studied by the correlational-factorial method. 








16 PSYCHOMETRIKA 


the use of coins if it were not for certain practical difficulties involved 
in tossing the number of coins sufficient for the purpose. A detailed 
exposition of the manner in which the tables of random numbers were 
utilized would require too much space, but it should be noted here that 
the procedure depends upon the defensible assumption that the odd- 
or even-ness of a digit, as an element, is a chance affair, and therefore 
analogous to tossing an unbiased coin. Variables may be defined in 
terms of overlapping or common, plus specific, elements. The theo- 
retically expected 7’s, obtained by the common element formula, pro- 
vide the universe correlation matrix of known factorial composition. 
An “individual,” and his scores on the given variables, can be defined 
in terms of a column of numbers, the column being of predetermined 
width and yielding scores by an actual count of the odd digits. The 
limit to the size of a truly random sample is circumscribed only by 
the extent of available tables of random numbers. 

We have tried out the criterion on six independent empirical sit- 
uations involving true samplings drawn from universes of known fac- 
torial composition. All intercorrelations were computed by the prod- 
uct-moment method. The essential facts concerning these setups as 
regards the size (N) of the samples, the number (n) of variables, 
and the known number of factors for the universe are given in Table 
2, wherein will also be found the limit (1/\/N) which the given S.D.’s 
for the partial residuals should reach. It will be noted from this table 
that the S.D. (c,) for the sth partial residuals, where s equals the 
known number of factors, does in all six cases tend to approximate 
closely or fall below the criterion level. It will be further noted that 
the reduction in the S.D. for the partial residuals which results from 
extracting s + 1 factors is very small, especially when compared to 
the reduction in S.D. as one goes from s — 1 to gs factors. 

We have tried out the proposed criterion on Brown and Stephen- 
son’s (2) data, for a sample of 300 and the 19 variables left after the 
purge of tests which disturbed the single factor hypothesis. These 
data, it will be recalled, satisfied the tetrad criterion. The S.D. of the 
first partial residuals is .064, which is not quite down to the value of 


1/\/N , i.e., 058. Application of the criterion to the data of Thurstone 
(6) based on 57 tests and an N of 240 would indicate the extraction 
of fewer than five factors. The fact that Thurstone has reported addi- 
tional data on new samples which tend to confirm the existence of 7 
or 8 factors for his battery would seem to cast considerable doubt on 
the validity of our proposed criterion. It must be noted, however, 
that the consideration of the effect of chance sampling errors in Thur- 
stone’s data is subject to a handicap—the actual number of individuals 
taking the various tests range from 104 to 234, and we are nowhere 
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told just what the N’s are for the intercorrelations. Presumably they 
will be still lower, and most certainly variable. 


Check on other criteria 


It may not be out of order to indicate how well a couple of other 
proposed criteria for the number of factors work on our six empirical 
series. The first criterion of Tucker (6, p. 66) and later versions there- 
of (1, 7) fail in all six cases. The more recently proposed criterion of 
Coombs (3) checks in the 9-1 and 12-2 series, but for the remaining 
four series, one would need to extract at least two or more factors 
than exist in the universe before the criterion is reached. Strangely, 
for the 15-3 setup, a strict adherence to the Coombs criterion would 
not permit the extraction of any factors even though it is known that 
three sizeable factors exist in the universe from which the sample 
was drawn. 

The failure of the Tucker criterion need not be surprising—the 
surprising thing would be to find that it did work since it is solely a 
function of x, the number of variables. The present writer cannot 
entertain any hopes for Coombs’ proposal since it also is mainly a 
function of ~. Surely the size of the sample must have something to 
do with the amount of variance which may be attributed to chance 
sampling. 

Although our proposed criterion seems to work fairly satisfac- 
torily, there are two limitations which should be mentioned. Firstly, 
we have no evidence that it will be adequate for those situations, most 
frequently encountered in practice, where the number of variables is 
larger than in our empirical series. And secondly, we must admit 
that the sampling error of the residual, 7,5.12...., aS a partial correla- 
tion may not be strictly comparable to that of the ordinary partial 
since 7,» has been utilized in determining the values of the factor load- 
ings which have then been used in calculating the partial residuals. 
These points need further investigation. 


TABLE 1 


Empirical Check on Error Involved in Approximation By Formula (14) for the 
S.D. of the sth Partial Residuals (Six independent sets of data) 


No. variables 9 10 10 10 12 15 
FS 1 2 3 4 2 3 
Actual S.D. .086 .073 .052 .047 .060 .069 


By formula (14) 088 074 045 .046 .061 066 
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TABLE 2 
Empirical Check on the Partial Residual Criterion 
N 150 208 250 208 150 150 


No. variables 9 10 12 10 15 10 
No. factors 1 2 2 3 3 4 
1/VN .082 .069 .063 .069 .082 .082 
18) .088*  .468 .364 001 043 .221 
oF) .081 .074* .061* .198 .288 .154 
397 .066 .055 .045* .066* .095 
471 .041 .061 .046* 
5°] 048 


* These values should be down to or below the values given for 1/VN. 
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This paper relates the constant process used in psychophysics 
to the problem of item selection. Each test item may be described in 
terms of a limen, which is an index of the point at which an item 
discriminates, and the standard deviation of the limen, which is an 
index of the ‘goodness’ of discrimination. The method developed may 
be related not only to the description of items but also to the de- 
scription of persons. Thus a person’s ability may be described in 
terms of a limen and its standard deviation. 


Within recent years an increasing tendency has become apparent 
among psychometricians to bring the methods of mental measurement 
more directly in line with the psychophysical methods of experimental 
psychology. This tendency is particularly marked in existing methods 
for the scaling and standardization of tests and in some of the many 
techniques recently developed for the selection of test items. The ap- 
plication of the psychophysical methods in the scaling and selection 
of test items was foreshadowed by a number of writers: Binet (1, 
1908), Thurstone (2, 1925), Thomson (3, 1926), and Symonds (4, 
1929). Guilford (5, 1936) furnished a complete formulation of the 
problem with which this paper concerns itself, but offered no solution. 
He writes, “If one could establish a scale of difficulty in psychological 
units, it would be possible to identify any test item whatsoever by giv- 
ing its median value and its ‘precision’ value in terms of hf as in the 
method of constant stimuli. This is an ideal towards which testers 
have been working in recent years and already the various tools for 
approaching that goal are being refined.” 

The only practicable solution to this problem as formulated by 
Guilford involves the establishment of an arbitrary scale of ability on 
the assumption that ability is normally distributed in the population, 
and the description of the performance of any given person in terms 
of o-units on this arbitrary scale. This is an orthodox statistical pro- 
cedure, and is frequently used in the standardization of tests. The 
procedure is valid when the age-range of persons tested is small. 
When, however, the age-range is large a development of the technique 

*I am indebted to Godfrey H. Thomson of the University of Edinburgh for 
reading and criticising part of this paper, and to D. N. Lawley of the same Uni- 


— for assistance in the development of certain of the arguments contained 
erein. 
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is required since the variance of ability is not independent of age. In 
the following discussion we shall confine ourselves to a single year of 
age-range, making the assumption that the variance of ability is equal 
for each month of age. 

In the estimation of a two-point tactual limen by means of an 
aesthesiometer, for example, the limen is regarded as that point, 
usually in millimetre units, where the probability of either a one- 
point or two-point judgment is one half. The scatter of the limen is 
described in terms of its variance or in terms of a “precision value,” 
h, which is in fact a weight proportional to the reciprocal of the 
standard deviation of the limen. In the application of the constant 
process* to item selection, we define the limen as that point measured 
in “o-units” of ability where the probability of a person of that abil- 
ity either passing or failing the item is one half. This limen is the 
point of discrimination. The standard deviation of the limen is an in- 
dication of the “goodness” of discrimination. Having estimated these 
two parameters for each item of a test we are in a position to esti- 
mate the probability of any given person passing any given item. The 
estimation of such a probability is fundamental in the reduction of 
mental-test method to a sound theoretical basis. Within recent years 
much effort has been wasted in the effort to apply correlation meth- 
ods to problems where the more elementary theory of mathematical 
probability would have produced results of substantially greater sim- 
plicity and value. 

In outlining the application of the constant process to item selec- 
tion let us presume that we have constructed a provisional test of a 
large number of items from which we wish to select those items which 
are to be included in the finished test, and that we have given this 
provisional test a preliminary tryout on a representative sample of 
the age range for which the test is ultimately intended. Let us pre- 
sume that the age range is not greater than a single year. The score 
obtained by a person on the complete test must, in the absence of any 
better criterion, be regarded as the best available estimate of a func- 
tion of that person’s ability. 

On the assumption that the distribution of ability is normal, let 
us divide the persons tested into k categories, the class interval ex- 
pressed in “«s-units” of each category being the same. Thus if we were 
to divide our persons into seven categories (k = 7) and to adopt .60 
as the class-interval, the percentage falling in each category would be 
as shown in Figure 1. 

Assign to each category a value x in terms of “s-units” equal to 


*Throughout I have adhered to the practice suggested by Thomson (6) of 
using the word process to imply a process of calculation. 
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” fais. 


6.7% 11.7%| 19.8%] 23.6% | 19.8% | 11,.% 6.7% 























“1.509 +-.90 -.00 30 90 1.50 
Fig. 1 


the mid-point of that category. For all practicable purposes it will be 
sufficient to regard the mid-points of the categories at the tails of the 
distribution as in the above example —1.8c and 1.8c. We have now 
grouped our sample of persons into categories on a scale of ability in 
which the differences between the mid-points of the several categories 
are considered equal. 

We now construct an answer-pattern, and determine the number 
and proportion of persons in each ability category passing each item. 
The proportions, p;, D2, +++, Px, When plotted against the mid-points, 
X,, Xo, ++, X,, Of the corresponding ability categories expressed in 
“g-units” should, on the basis of the phi-gamma hypothesis, form the 
integral of the normal curve of error; that is, if we transform the 
proportion p to a new variable y by the relation 


: oe 2 
— 2 
p Via J e 2 du, 
then on the basis of the phi-gamma hypothesis there exists a linear 
relationship between x and y. 
From the proportions 7, , D2, ++: » Px, the value of the limen and 
its standard deviation can be estimated for each item in terms of “o- 
units” of ability. These two parameters describe the functioning of 


“oO 








22 PSYCHOMETRIKA 


the item, the limen being a measure of the point at which the item 
discriminates and the standard deviation being an indication of the 
“goodness” of discrimination. Any one of a number of processes in 
general use among psychophysicists may be used for the estimation 
of these parameters. 

For a complete rationalization of the problem, the constant pro- 
cess, sometimes termed the Miiller-Urban method, should be used. 
The use of this process as applied to this particular problem involves 
the weighting of the observations by the combined Miiller-Urban 
weights and also by the number of cases upon which each value of p 
is based. This process, although theoretically the most admirable, in- 
volves much arithmetical labour, and can not, therefore, be regarded 
as practicable for the routine purpose of item selection. 

The application of Spearman’s arithmetic-mean process is to a 
large degree invalidated in this connection by virtue of the tail as- 
sumptions involved. With items whose limina are near the mean of 
the ability scale, the Spearman process yields reasonable estimates 
of limina and standard deviations, since with items of this type the 
tail assumptions necessary are not great. However, with items whose 
limina are near the extreme of the ability scale, it becomes necessary 
to estimate the required parameters from a few values of p at one 
end of the distribution. With items of this type it becomes impossible 
to fix in any reasonable fashion the tails of the distribution. 

Other processes suggest themselves which are simple in type and 
yield results sufficiently satisfactory for routine purposes. The first 
is the process of simple linear interpolation in which the median or 
50% point is regarded as an estimate of the limen. This process has 
the disadvantages that it uses only two values of p, that it involves 
the assumption that the curve is a straight line between the two val- 
ues of p in question, and that it yields no measure of scatter. The 
last of these objections may be eliminated by calculating the 16% and 
84% points, and taking half the distance between these two points as 
a rough estimate of the standard deviation. With items that are very 
difficult the 84% point, and with items that are very easy the 16% 
point, can not readily be calculated. In such cases it is necessary to 
use the difference between either the 84% point or the 16% point and 
the 50% point as a very rough estimate of the standard deviation. 
While this process is open to many serious objections on theoretical 
grounds, the calculation involved is small, and the estimate obtained, 
although subject to substantial error, will be found satisfactory when 
no great accuracy is desired. 

A process that avoids some of the disadvantages of linear inter- 
polation between the observed proportions is to determine a sigma 
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value y corresponding to each value of p in the normal ogive from 
tables prepared for this purpose. The sigma values y when plotted 
against values of « should assume a linear relationship. We can now 
proceed to estimate the required parameters either by simple inter- 
polation between the sigma values or by calculating the slope of the 
best-fitting least-squares line. This least-squares line process is in 
fact the constant process without the use of weights. If the data are 
arranged such that © « = 0, the slope of this line is given by the re- 
lation 





_ ary 
b= Sa 
and the limen by the relation 
_ ay 
L= cb" 


The standard deviation of the limen is equal to the reciprocal of 
the slope. 

To illustrate the functioning of item selection by the constant 
process and to determine the relative efficacy of the various processes 
suggested for estimating the limen and standard deviation, the fol- 
lowing short experiment was conducted. From the test scripts of a 
complete year group of 11+ children, who had taken a Moray House 
Test, a sample of 216 scripts chosen at random was selected. The 216 
children whose scripts were selected were found to be representative 
of the complete year group of 11+ children. These 216 scripts were 
then divided into seven categories of equal class interval in terms of 
“g-units.” The following table shows the number and proportion of 
persons in each category. 

The mid-points of the categories at the tails of the distribution 











TABLE 1 
Mid-point 
of class interval % in No. in 

x (o-units) category each category 

1.8 6.7 15 

1.2 11.7 25 

6 19.8 43 

0 23.6 50 

— 6 19.8 43 

—1.2 11.7 25 


—1.8 6.7 15 
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TABLE 2 
x (o-units) —18 —1.2 —.6 0 6 1.2 1.8 
item Proportions passing item 
1 33 56 1 94 95 96 98 
2 13 36 65 .82 91 1.00 1.00 
6 .07 -00 .30 44 Ef .88 1.00 
12 .07 .20 .28 52 .63 .88 87 
20 13 B [°- Az .28 47 72 87 
28 .00 .08 .05 18 .20 36 47 





are taken as 1.80 and —1.8c. These values are of course not the mid- 
points, but the error introduced by their adoption is negligible. 

An answer pattern was then constructed and the number and 
proportion of persons in each category passing each item were deter- 
mined. 

For purposes of illustration, Table 2 gives the proportion of 
persons in each category for six different items. 

From these proportions we may proceed to estimate a limen and 
a standard deviation for each item. For comparative purposes these 
parameters have been estimated for the foregoing six items in a num- 
ber of ways. 

Firstly, the constant process,* or Miiller-Urban method, was used. 
Each value of » was weighted by the Miiller-Urban weights and by 
the number of observations upon which it was based. This amounts 
to weighting each proportion p by the product of the Miiller weight 


en. : . 
and the quantity ma the reciprocal of the variance of each propor- 


tion. The values of the limina calculated by this process are given in 
Table 3 and the corresponding standard deviations in Table 4. Fig- 
ures 2 to 7 show the obtained values of p plotted against values of x 
for the six items with the best fitting normal ogives. 

The limina and standard deviations were calculated also by Spear- 
man’s arithmetic-mean process, by simple linear interpolation, and by 
fitting a least-squares line to the sigma values y corresponding to 
values of p in the normal ogive (the constant process without 
weights). The values of the limina calculated by these three pro- 


* Although the term constant process is correct, in the strictest conventional 
sense, only when limina and precision values are estimated, I employ the term to 
relate also to the process whereby standard deviations are estimated. The stand- 
ard deviation is a simple function of the — value, the relation being 


3? — —. 
2h2 
Throughout I have calculated values of s instead of values of h. 
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1.00 Test Item 12 
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TABLE 8 
Values of Limen Obtained by Different Processes 














Test item 
Process 1 2 6 12 20 28 
Constant Process —148 — 88 — .08 .06 .64 1.79 
Arithmetic Mean —__........ — .82 .03 .07 ee fee 
Linear Interpolation —136 — 91 11 — .05 .67 1.96 
Least squares line 
process —1.58 — .82 03 — .04 51 2.12 





cesses are given in Table 3, and the values of the corresponding stand- 
ard deviations in Table 4. 


TABLE 4 


Values of Standard Deviations Obtained by 
Different Processes 














Test item 
Process 1 2 6 12 20 28 
Constant Process 1.44 1.00 1.05 1.31 1.40 1.88 
Arithmetic Mean ____......... .86 .89 1:22 [re 
Linear Interpolation 1.11 93 1.16 1.24 1.29 2.06 
Least-squares line 
process 1.66 93 93 1.31 1.48 1.87 





Spearman’s arithmetic-mean process is invalidated for items of types 
1 and 28 by virtue of the tail assumptions involved. If the constant 
process can be taken as furnishing the best estimates of the required 
parameters, Spearman’s process, as is to be expected, furnishes sub- 
stantial underestimates of the standard deviations. 

The estimates obtained by fitting a least-squares line to the sigma 
value y corresponding to values of p in the normal ogive differ some- 
what from the estimate obtained by the constant process. These dif- 
ferences result from the absence of weights. Vaiues obtained without 
the use of weights must be regarded as rough estimates only. In ap- 
plying the least-squares line process, it will be found desirable to de- 
lete certain of the extreme values of y which are based on compara- 
tively few cases and to fit the line to the four or five central points. 

The estimates obtained by linear interpolation seem to approxi- 
mate about as closely to the estimates obtained by the constant pro- 
cess as those obtained by the least-squares line process. When the 
data neither warrant nor demand great accuracy, the results obtained 
by linear interpolation will be sufficiently approximate for routine 
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purposes. If substantial accuracy is desired, a sample much greater 
than 216 cases is necessary. 

The application of the method described above relates mental- 
testing technique directly to the psychophysical methods. From esti- 
mates of the limen and standard deviation it is possible to estimate 
the probability of any given person passing any given itent. For ex- 
ample, the limen and standard deviation for item 20 estimated by the 
constant process are respectively .64c0 and 1.40. This limen implies 
that the probability-of a person of ability .64c either passing or fail- 
ing the item is 4. The probability of a person of ability 2.040 passing 
this item is .84, while the probability of a person of ability —.760 
passing the item is .16. I am of the opinion that the estimation of such 
probability will have further application in mental-test theory. 

Just as we describe an item in terms of a limen and standard 
deviation, it is possible to reverse the process and, on the assumption 
that the items on a given test are representative of a defined popula- 
tion of items, to describe each person in terms of a limen and a stand- 
ard deviation. Thus the ability of a person could be described in terms 
of a level of difficulty where the probability was $ that a person would 
either pass or fail tests of that difficulty. 

Such a limen would be closely analogous to a eo threshold. 
Furthermore, we could describe the relative abilities of a number of 
persons in terms of the probabilities of their passing a task at an ar- 
bitrarily specified level of difficulty. 

The technique of constructing a test whereby persons could be de- 
scribed in terms of limina and standard deviations or precision values 
would seem at the moment to be something as follows. As previously 
we divided persons tested into k categories with equal class intervals 
expressed in “s-units,” so now a test could be constructed of k sub- 
tests of increasing order of difficulty. Our subtests would be selected 
such that the differences in difficulty between them would be in terms 
of equal “s-units” of difficulty, relative, of course, to some defined 
population. One would then proceed to calculate for any given person 
the proportion of his successful responses in each subtest, ascending 
as they do in equal “o-units” of difficulty, and assume as before that 
the proportions, p, , p.,--- , Px, When plotted against the mid-points, 
Ly, %2,+++ , %, Of corresponding difficulty categories, form the inte- 
gral of the normal curve of error. 

The next step would be to estimate for each person a limen and 
a standard deviation or precision value by any of the methods de- 
scribed above or for that matter by any efficient method of estimation. 
The limen for persons would be in terms of “o-units” of difficulty. 
Thus, ability is described in terms of a parameter the implications of 
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.which is that the probabilities are equal that any given person will R 

either pass or fail tasks of a certain defined difficulty. 
In such a procedure we may well enquire what would be the 

meaning of our second parameter for persons, the standard deviation 

“ or precision value. It is apparent that this second parameter is an in- 
dex of the’degree to which the responses of any given person differ 
from the response of the average person in a defined population. Thus, 
what person a finds difficult may differ somewhat from what person b 
finds difficult and from what is regarded as difficult by the average 
person jn a given population. Thus, although the average ability, as 
it were, of a given person over a series of tasks may be about the 
same as that of the average person, he may find some tasks easier and 
some more difficult than the average person; that is, difficulty for any 
given person is not the same as difficulty for the average person. 
Hence our second parameter is a measure of a difference between any 
given person and the average person, and as such is a type parameter. 
Indeed it bears a kinship to the second or bipolar factor usually found 
in the factorial analysis of persons. 
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Allport’s J-curve hypothesis of conforming behavior and its 
attendant treatment of appropriate data are criticized on the follow- 
ing points: (1) narrowness of application; (2) flexibility of inter- 
pretation of results; (3) arbitrary selection of a criterion of con- 
formity; (4) lack of a means by which the extent of conformity in 
one situation can be compared with that in another; (5) inequality 
of “telic” units. As an alternative treatment of such data, the meth- 
od of higher moments is suggested and rationalized. Data from All- 
port and Solomon are reworked by this method and results compared. 


F. H. Allport (2) has indicated the importance of conformity to 
sociology and social psychology. We agree that there is a need for 
more quantitative studies in this field and for a methodology. The 
validity of the “J-curve hypothesis” in this respect is, however, open 
to question. It is the purpose of this paper to investigate some of the 
shortcomings of this hypothesis and to suggest an alternative method 
based on established statistical procedures. 


The “J-Curve Hypothesis” 


Fields of Conformity. Basic to the “J-Curve hypothesis” is what 
Allport calls a “conformity situation” or “field of conformity” which 
he has defined (2, p. 912) as follows: “A conformity field exists when 
there is a generally accepted, though not necessarily explicitly stated, 
rule and purpose in the situation, and when fifty per cent* or more of 
the population fall upon the first step of a telic (conformity) con- 
tinuum whose variable is degrees of fulfillment of this purpose.” 

Without such a “field of conformity,” the “J-curve hypothesis” 
is not applicable. Thus this approach is limited to cases where (1) 
there is a discoverable, though “not necessarily explicitly stated” rule 
or custom; (2) “complete conformity” has been defined and (3) a 
majority of the population “conform completely.” 

If we assume that the “J-curve hypothesis” can be used to meas- 
ure conformity where it exists, it follows that no conformity of be- 
havior obtains unless the foregoing conditions hold. If on the other 
hand we hold that there may be conformity of behavior even in the 
event that we are not immediately able to define a specific rule or cus- 


* Italics mine. 


31 








32 PSYCHOMETRIKA 


tom or where somewhat less than 50% of the population conform 
fully, it immediately follows that even if valid, the “J-curve hypothe- 
sis” has at best a very limited application. A more widely applicable 
technique for the study of conformity is needed. 

Having defined the “field of conformity,” Allport continues (2, p. 
913): 


If, in any field of conformity (see definition given 
above) we apply a scale whose steps are variations of beha- 
vior which represent successive recognizable degrees of ful- 
fillment of the “accepted common purpose,” ranging from the 
prescribed or “proper” act, which most completely fulfills the 
purpose (on the left) to that which gives it the least rec- 
ognizable amount of fulfillment (upon the right), the fol- 
lowing will occur: (a) more instances will fall upon the step 
at the extreme left than upon any other; (b) the successive 
steps from left to right will have a respectively diminishing 
number of instances; and (c) the decline in the number of 
instances will decrease as we proceed by successive steps 
from left to right. 


A second portion of the hypothesis dealing with the empirical 
continuum is stated as follows (2, p. 916-917): 


In any conformity field the distribution of measurable 
variations of the behavior upon a relevant empirical, or non- 
telic, continuum is in the form of a........ unimodal, 
double-J-curve (i.e., a curve having positive acceleration of 
both slopes), in which the mode is likely to be off center and 
the slopes are likely to be asymmetrical. 


We shall consider each of these parts of the statement of the hy- 

pothesis separately and in reverse order. 
The “Double-J” Curve. In the earlier descriptions of the form of the 
empirical distribution of conformity data, Allport (1) stressed three 
points. First, the curve was said to be steep or leptokurtic, second, 
positively accelerated toward its single mode from both directions, 
and third, “likely” to be asymmetrical. 

Dudycha( 5) pointed out that in applying statistical measures of 
kurtosis to his own as well as to Allport’s data, some of the distribu- 
tions were found to be leptokurtic, but others more mesokurtic and 
some even platokurtic. In reply to this, Dickens and Solomon (4) 
claim that Allport had misused the term “leptokurtic” and that “the 
distinction between a normal distribution and a “double J” is based 
chiefly on the criterion of acceleration as you approach the mode.” 
From this it appears that the “double-J” may be steep or flat, sym- 
metrical or skewed, so long as it is positively accelerated toward the 
mode from either direction. 
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The normal curve is positively accelerated toward the mean (or 
mode) from both directions up to one standard deviation either side 
of the mean. If it so happened that in studying any particular distri- 
bution an investigator happened to select a rather gross step interval, 
a perfectly normal distribution would take on the most distinguish- 
ing characteristic of the “double-J.” To verify this possibility we have 
taken a distribution of the height of 1079 men as presented by Pear- 
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FIGURE 1 
son and Lee (7). A chi squared test of goodness of fit indicated that 
any deviation of this distribution from normal must be attributed to 


chance factors. When these data are distributed with the use of a 
three-inch step interval, the slopes of the resulting curve are posi- 
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tively accelerated toward the mode throughout the range (Fig. 1). 
Thus can perfectly normal data yield a perfect “double-J” curve. 

The “telic continuum.” In constructing a “telic continuum” it is first 
necessary to define the purpose being fulfilled The next step would be 
to determine what is to be considered “full conformity.” One question 
that immediately arises in this respect concerns the extent to which 
behavior reflects the intent of the individual. It is impossible to say 
because one is late to work that he did not intend to be on time. 
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Further, it must be pointed out that in many cases the definition 
of “full conformity” may be highiy arbitrary. A “J-curve” will re- 
sult from almost any distribution if the proper left-hand step is chos- 
en. Thus if we refer to the data on the height of 1079 men (see 
above) and say that people should not be more than 69 inches tall, 
we may construct a “telic continuum.” This continuum is presented 
in Figure 2. The original distribution of the data on which it is based, 
it will be recalled, is normal. 
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“Telic” Units. 

If the “telic continuum” is to be of any use at all, units on it 
should be at least psychologically if not physically equal. In early 
studies such as that of conformity of the response of drivers to a stop 
sign at an intersection, little attempt was made to equate the inter- 
vals. In this study, [Allport (1)] the units employed were (a) come 
to a complete stop, (b) slow down considerably (c) slow down slight- 
ly, and (d) go ahead with no reduction of speed. If we consider a 
driver who had been driving at 50 miles an hour and slowed down to 
25, another who had been going 30 and slowed to 25, and a third who 
had been going 25 and did not reduce his speed at all we have three 
individuals all crossing the intersection at the same speed. From one 
point of view, each of these drivers is failing to conform to the same 
extent. Yet according to Allport’s criteria, each would fall on a dif- 
ferent step of the continuum. By following Allport’s logic in this case, 
if all three drivers had come to a complete stop, we would have to 
consider the first as conforming more completely than the other two, 
since his decrease of speed would have been greatest. By the same 
token, the second must in stopping have conformed to a greater ex- 
tent than the third. Thus complete conformity would require not only 
that the driver stop, but that the first obtain the greatest possible 
speed. This reductio ad absurdum can be applied equally well to other 
measures of “complete conformity.” 

In the study of Allport and Solomon (3) on lengths of con- 
versation in church, library, and club-room situations, an attempt 
was made to obtain equal “telic” units. To obtain an evaluation of 
various degrees of annoyance due to conversation of other people in 
such situations, five statements were selected for each of the three 
situations. These statements for the library situation were: 


1. Iam slightly distracted from what I am doing. 

2. Iam beginning to notice the conversation. 

3. I feel like stopping my activities and staring or ac- 
4 

5 


tually do so. 
I feel like getting up and asking them to stop or 


actually do so. 
I feel like asking an authority to make them stop 


or actually do so. 


The extent of annoyance expressed by each of these feelings or ac- 
tions was rated by a modification of the Thurstone attitude technique, 
and an annoyance value was assigned to each. 

Next, a scale was prepared for each of the situations. On each 
scale each of the five statements appeared. Below each was a line 
marked off in 15-second intervals. The subjects were asked to indi- 
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cate by checking on the appropriate line how much time would elapse 
before each thought he would experience the state of annoyance ex- 
pressed by the statement. 

A graph was then constructed for each of the situations. Along 
the abscissa were plotted the median times at which the individuals 
tested reported that they would experience the state described by each 
of the five statements rated. Along the ordinate, the annoyance val- 
ues of each of the five staternents were plotted. Horizontal lines were 
drawn from each of the five points on the ordinate to where each met 
a vertical line through the corresponding (i.e., from the “time value” 
of the same statement) point on the abscissa. By drawing a line 
through the points at which the five sets of lines intersected, a con- 
tinuous and infinitely divisible unit of time with respect to equal 
units of annoyance value was graphically determined. 

This aspect of the experiment was duplicated with respect to the 
library situation in the psychology classes at Northwestern Univer- 
sity.* In addition to the 15-second interval used by Allport and Solo- 
mon, intervals of 60 seconds, 5 seconds, and one second were em- 
ployed. If the results obtained by Allport and Solomon were indepen- 
dent of the interval employed, any change in the interval would not 
influence the size of the resulting units. Table 1 gives a comparison 


TABLE 1 


Comparison of the Appearance of Different Degrees of Annoyance as a Function 
of the Time Interval Employed. All Data in Library Situation. 
(All Measurements are in Minutes) 


Degreeof Allport & 





Annoyance Solomon Northwestern Data 
1/4 min. 1/4 min. 1 min. 1/12min. 1/60 min. 

A 1.25 2.00 4.63 1.00 25 
B 1.58 2.00 8.67 51 .20 
Cc 3.15 2.50 4.34 1.00 33 
D 5.75 4.66 14.48 3.00 50 
E 7.90 7.538 25.00 5.00 8.00 
N 56 40 118 69 104 


of the present results with those obtained by Allport and Solomon. 
Where the intervals were both 15 seconds, the results were compar- 
able. Where different time intervals were used, the results were en- 
tirely different. Table 2 shows the “infinitely divisible telic units” as 
based on three different time intervals. Obviously the units arrived 
at by Allport and Solomon reflected primarily the time interval em- 


* These data were originally presented before the 1940 meetings of the Mid- 
western Psychological Association. 
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TABLE 2 


Comparison of Equally Divisible Telic Units as a Function 
of the Time Interval Employed 
(All Measurements are in Minutes) 





Telic Step Allport & Solomon Northwestern Data 
1/4 minute 1/12 minute one minute 

Complete 

Conform. 1.5 .20 3.75 
1 1.9 225 4.00 
2 2.4 .250 4,25 
3 2.8 275 4.50 
4 3.3 300 6.00 
5 3.8 850 7.87 
6 4.3 875 9.87 
7 4.85 .400 11.87 
8 5.40 425 17.50 
9 6.15 1.32 22.00 

10 6.15 2.65 


ployed by them. Since the results are a function of the scale employed 
rather than the general situation, they are of questionable significance. 

Having considered certain inadequacies of the J-curve hypothesis 

in its present form, it is a further purpose of this paper to re-inves- 
tigate the phenomena of conformity both from theoretical and empi- 
rical points of view in order to determine a more adequate method of 
treating the data. 
Quantitative Comparisons of Conformity. One of the greatest weak- 
nesses of the “J-curve hypothesis” lies in the fact that it provides no 
means by which the extent of conformity can be given quantitative 
expression. A chi squared test may be made to determine whether 
the empirical observations differ significantly from a normal distri- 
bution. No similar technique, however, exists for the testing of the 
significance of a single “J-curve” or for determining whether greater 
conformity exists on one “J-curve” than on another. 

Both Dudycha (5) and Solomon (8) have attempted to give a 
mathematically expressed index of conformity. Since, however, neith- 
er has produced the distribution function for his formula and its mo- 
ments, the expressions are rendered useless by virtue of the fact that 
they provide no means by which to determine whether an obtained 
difference between two “J-curves” could have arisen by chance. 

Any measure of conformity, if it is to be at all useful in the 
study of social psychology, must be able to indicate whether or not 
in observed data conformity to the extent observed could be attributed 
to chance, and whether differences between conformity in two inde- 
pendent situations could have so arisen. It is our hope to describe a 
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method of measuring conformity which is free from the artifacts 
of the “J-curve hypothesis” and at the same time exercises the above- 
mentioned controls. 


A Statistical Theory for Measuring Conformity 


Basic Tenets. It is generally acknowledged that the final appearance 
of a bit of behavior is the result of the interplay of a great number 
of factors. Some of these tend to promote the appearance of the be- 
havior while others tend toward its inhibition. Some of the factors 
are strong and some are weak. We may never know all of the factors 
that finally result in a given bit of behavior, but without some such 
theory of causality the entire world of science as we know it today 
would collapse. 

If we were to label each of the many factors which work to bring 
about the appearance of any bit of behavior under consideration as 
p; where j = 1, 2, 3, --- , k; and if we were to label all the inhibitory 
factors as q, where u — 1, 2, 3, ---, v, then by setting the total sum of 
these factors equal to unity, we would have the relative weight of 
each ; thus, 


u=v j=k 


Sau+ Sp =1. (1) 
u=1 j=1 
The first term on the left-hand side of equation (1) gives us the 
relative weight of all the inhibitory factors while the second term 
gives us the relative weight of the facilitory factors. If we call these 
q and p respectively, then (1) reduces to 


qtp=1. (2) 


Thus qg comes to represent the probability that the given beha- 
vior will fail to appear in any given instance and p the probability of 
its appearance. By raising this expression to the sth power, thus, 


(eto =?, (3) 


we achieve the probability distribution of the occurrence of the be- 
havior. 

When s is large and the factors favoring the appearance of the 
behavior are equal to those opposed to it, the binomial expansion 
above approaches a normal curve; thus, 


(2-2)? 


lim (q+ p)*=(2a)+e dz, (4) 


where 


q=p=}. 











E. T. KATZOFF 39 


The moments of the binomial in terms of those of the normal 
curve are 














x= sp; (5) 

o= VsSpq; (6) 

a, = 22, (7) 
Vspq 

a, = . — 6/a + 8. (8) 
spq 


Every psychologist is familiar with the first two moments, the 
mean (x) and the standard deviation (c). a; and a, are the third and 
fourth moments about the mean of a distribution of standard meas- 
ures. They may be computed from obtained data as follows: 








282) : 
and 
[oe - 2¥ 
0 = 57 5 ( - ). (10) 
For the normal curve the values of these parameters are 
£=0; (11) 
o—1; (12) 
a; = 0; (13) 
a,=38. (14) 


Application of a; to conformity. 

The assumption underlying the application of the foregoing bi- 
nomial to data is that the factors making for the occurrence of the 
behavior under consideration are equal in potency to those which 
would make for the failure of the behavior to occur. In a situation 
where social or other forces operate strongly to bring about a certain 
type of behavior, the equality of p and q no longer exists and p > q. 
Where the forces operate to inhibit a mode of behavior, the opposite 
relationship obtains and p < q. 

In cases where p = q = 3} the curve is symmetrical and a; = 0. 
When p # q the symmetry no longer exists and a, ¥ 0. 

From the formula 


er 
a3 — tinaine t 
VSpq 
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it is clear that the size of a, will vary with the difference between 
p and q when s is constant. Thus to test whether or not factors are at 
work to bring about conformity in a given social situation would re- 
quire testing the null hypothesis that the a; of the distribution does not 
differ from zero. If this hypothesis is rejected* it may be stated, with- 
out recourse to any of the artifacts of the “J-curve hypothesis,” that 
factors are operating which tend to force conformity. Further, it is 
possible by the use of this statistical treatment to determine whether 
the conformity in one situation is significantly greater than that in an- 
other. This may be accomplished by taking the ratio of the difference 
of the a,’s to the standard error of that difference. If we call this 
ratio “t,” the formula for its determination is 


(= (a3, os as ) (07a, =. 07a, ) “t2 (15) 
where 
6N (N-1) 


s — (N—2) (N+1) (N+3)- (16) 





oa 


Conformity vs. Uniformity. 

Allport and Solomon (p. 420) point out the fault of failing to 
distinguish between conformity and simple uniformity. They do not, 
however, give any statistical measure with which to determine this 
difference. They do assume that uniformity is greatest in the church 
situation by virtue of the fact that 27% of the cases fall upon the 
modal step of their empirical continuum. 

Uniformity of behavior (where there is an equal probability for 
the phenomenon to either occur or fail to occur) is indicated by the 
kurtosis. To return to our analogy in the binomial expansion, it is the 
case where, in the formula (q + p)*, s is small and the possibilities of 
response limited. From formula (8) it follows that as s > o the first 
two expressions on the right-hand side of the equation approach zero 
and o, approaches 3. 

When a, of a distribution is significantly} greater than 3 we may 
say that the response in a given situation is more uniform than could 
be explained by chance. 


The Application of the Method of Moments 
To compare the methods outlined above with those of the “J-curve 


(6N) (N—1) 
i . (N—2) (N+1)(N+3)_ 
To be significantly different from zero, a, must (in the case of large sam- 
ples) be 2.58 or more times greater than the square it of its standard error. 
2 





* The variance of a, is given by Fisher, (6, p. 79) as 





+ °? of a, is given by Fisher (6) as 


(N—8) (N—2) (N+3) (N+s)° 
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hypothesis,” the data presented by Allport and Solomon (3) on length 
of conversation in church, library, and club-room have been re- 
analyzed. The results are presented in Tables 3 and 4. 

The method of moments reveals that both conformity and uni- 
formity beyond that which might be attributed to chance factors alone 
exist in all three situations (See columns 2 and 4, Table 3). This fails 
to confirm the finding of Allport and Solomon to the effect that con- 
formity did not exist in the club-room situation. Table 4 indicates that 
any difference in skewness between the clubroom and library situation 
can be explained on the basis of chance alone. It is impossible to 
maintain, as Allport and Solomon do, that conformity exists in one 
and not the other of these situations. 


Summary and Conclusions 


The “J-curve hypothesis” of conforming behavior was examined 
and found to present three main weaknesses: (1) the necessarily 
post hoc definition of conformity in any given situation; (2) the 


TABLE 3 


A Comparison of a, and a, and Their Standard Errors in the 
Three Situations Presented by Allport and Solomon 



































Situation | N GO; a, a, %, 

Church 200 2.50 .1720 11.15 .0417 

Library | 802 1.96 .0860 8.94 .1726 

Clubroom | 400 1.78 .1183 8.15 .2420 
TABLE 4 


The Differences and the Significance of the Differences of 
a, and a, of Allport and Solomon Data 

















Situation Diff. pitt... . Diff ‘as pitt.g, ," 
Church-Library 54 1923 2.81 2.21 3827 >8 
Church-Clubroom 71 .2088 3.40 3.00 4192 >8 
Library-Clubroom .18 .1463 1.23 .79 .2980 2.65 























failure of the advocates of the hypothesis to secure equal “‘telic” units; 
and (3) the failure of the hypothesis to provide any means of com- 
paring the difference in conformity between two situations. A mathe- 
matical basis for the measurement of conformity was briefly out- 
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lined and the method of moments suggested as a more satisfactory 
way of dealing with measurements of behavior in this field. Finally, 
data presented by Allport and Solomon were subjected to re-analysis 
by the method of moments and the results presented. 

From what has been presented here we may conclude: 


(1) that the “J-curve hypothesis” is inadequate for the pur- 
poses for which it was designed. 

(2) that the “J-curve hypothesis” creates in conformity 
differences which have been shown to be statistically 
insignificant. (Library vs. club-room situations). 

(3) that no special technique is required for the analysis 
of conformity data. 

(4) that the method of moments is adequate to give quanti- 
tative expression to conformity and uniformity of be- 
havior, and 

(5) that this method distinguishes between conformity 
(skewness) and uniformity (kurtosis). 
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The purpose of this study is to investigate, by the method of 
paired comparisons, a possible scaling of individuals who have made 
certain test scores, such that the additive property will be satis- 
fied and such that a stability in scaling will be maintained ,—in other 
words, a scaling such that the scaled score of an individual will re- 
main relatively the same regardless of the grouping of individuals 
in which he may be placed. The results show that it is possible to 
utilize psychophysical methods in psychological and educational test 
situations. Among the major findings are that Case V of the Law of 
Comparative Judgment is applicable to the data in this problem, the 
method of dividing the intermediate category equally between the 
greater and the less was the best of three possible methods, inter- 
nal consistency was satisfied, and, finally, when a new test of stabil- 
ity was applied, it was found that the distances between the hypo- 
thetical individuals remain the same. 


It is well known that the raw scores of any test fail to satisfy the 
additive property and are merely expressed in terms of the relative 
numbers of items passed. With this in mind, the purpose of the study 
here reported is to investigate, by the method of paired comparisons, 
a possible scaling of individuals who have made certain test scores, 
such that the additive property will be satisfied and such that a sta- 
bility in scaling will be maintained, in other words a scaling such that 
the distance between the scaled scores of any two individuals will re- 
main the same regardless of the grouping of individuals in which they 
may be placed. In this particular problem, Thurstone’s Law of Com- 
parative Judgment (5) will be subjected to various tests to see wheth- 
er its reliability is consistently maintained. 

Psychophysical methods usually have been applied in the fields of 
sensory discrimination and in scaling attitudes and opinions. The 
application here of the method of paired comparisons lies in a differ- 
ent field, that of the mental and the educational test, wherein the in- 
dividuals are the stimuli and are scaled accordingly. 

For this purpose, a homogeneous vocabulary test, consisting of 
one hundred items in multiple-choice form, was selected from a Bio- 
logical Science examination developed at The University of Chicago. 

* The writer wishes to express appreciation for the invaluable help and guid- 


ance of Professor Harold O. Gulliksen and also to Professor Marion W. Richard- 
son who suggested the problem and made valuable suggestions. 
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From the group of college freshmen who took this examination, one 
hundred were selected by alphabetical order of last names. The test 
papers were made available by the Board of Examinations. If any 
particular person omitted fifty per cent of the items, his test paper 
was discarded. This criterion necessitated the elimination of only 


twelve tests from among those examined. 


Table I is an illustration showing how the score matrix for the 
hundred persons and the hundred items was set up. An unsuccessful 
response is designated by an x, an omitted item by an o, a right an- 


TABLE 1 


Record of Test Items Answered Successfully or Unsuccessfully by 


the Hundred Persons Taking the Test 
(An x indicates a wrong answer, an o an omitted item, 
a blank space a correct answer) 
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swer by a blank space. The hundred persons were originally arranged 
in rank order according to total scores. Regarding the table hori- 
zontally, the response of each person to each item of the test may be 
observed. Viewing Table 1 vertically, each item may be said to 
“judge” the persons taking the test in reference to success or failure. 
The hundred persons, arranged in rank order, were then grouped by 
fives, forming twenty hypothetical individuals. Thus the hypothetical 


TABLE 2 


The Number of Times the Hypothetical Individual Given at the Top of the Col- 
umn was More, Less, or Equally Successful when Compared with Those at the Left 
(The top number for each individual indicates greater, the middle number 


less, the lower number equal success) 






































Hypo- 
thetical Hypothetical Individuals 
Indi- 
viduals 12 3 4 5 6 7 8 9 1011 12 18 14 15 16 17 18 19 20 
1 
84 
2 15 
51 
46 34 
3 15 22 
89 44 
51 37 29 
4 12 18 238 
87 50 48 
49 39 34 28 
5 18 15 23 24 
88 46 48 48 
54 38 86 33 30 
6 12 14 22 23 24 
84 48 42 44 46 
58 44 42 37 88 32 = 
7 09 10 22 24 24 27 
388 46 86 89 48 41 
60 55 42 48 87 40 38 
8 09 12 16 25 20 19 27 
31 38 42 27 48 41 40 
65 56 46 48 45 438 41 38 
9 08 08 10 24 22 19 26 25 


27 36 44 28 33 38 33 42 
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TABLE 2 (continuted) 


The Number of Times the Hypothetical Individual Given at the Top of the Col- 
umn was More, Less, or Equally Successful when Compared with Those at the Left 
(The top number for each individual indicates greater, the middle number 
less, the lower number equal success) 





Hypo- piany PA ¥ pern 
thetical Hypothetical Individuals 
Indi- 
viduals 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 








67 58 54 47 45 43 43 387 35 
10 04 09 14 18 21 20 25 27 32 
29 33 32 35 34 387 32 36 33 


66 59 57 48 41 46 45 39 38 30 
11 07 09 12 19 18 19 23 28 31 30 
27 382 31 33 41 35 32 33 31 40 


69 67 62 59 50 54 46 40 36 36 36 
12 16 02 15 16 16 16 18 20 24 27 29 
15 31 28 25 34 30 36 40 40 87 35 


73 60 56 52 48 45 43 44 36 41 37 34 
18 06 06 13 11 11 16 14 26 30 25 34 35 
21 34 31 37 41 39 43 30 34 34 29 31 


73 68 63 60 54 51 52 44 42 42 37 34 35 
14 03 04 10 14 15 14 20 22 23 28 25 29 30 
24 28 27 26 31 35 28 34 35 30 38 37 35 


71 67 63 60 53 48 48 47 45 47 44 40 88 38 
15 05 08 07 11 11 14 15 24 21 21 26 29 24 34 
24 25 30 29 36 38 37 29 34 32 30 31 38 28 


58 55 48 54 45 46 51 39 35 
16 03 04 09 13 13 10 11 15 19 21 28 27 26 81 29 
15 24 20 21 22 28 31 30 33 25 32 27 23 30 36 


78 71 69 63 58 58 59 57 54 43 51 46 45 48 37 39 
17 04 02 06 12 09 14 10 18 18 17 18 20 20 24 29 32 
18 27 25 25 33 28 31 25 28 40 31 34 35 33 34 29 


83 84 73 70 69 70 67 60 57 55 58 53 51 50 45 38 36 
18 03 03 04 08 06 07 09 09 10 17 17 19 15 20 28 25 34 
14 13 23 22 25 23 24 31 33 28 25 28 34 30 27 37 30 


83 82 81 77 74 68 68 65 60 59 60 59 60 51 55 48 41 43 
19 04 02 02 11 06 06 07 15 12 15 18 11 14 16 21 28 23 27 
13 16 17 12 20 26 25 20 28 26 27 30 26 33 24 24 36 30 


84 87 85 84 80 80 80 80 75 72 65 69 72 66 55 57 54 57 45 
20 01 01 02 03 04 04 03 05 09 04 10 06 08 12 10 17 17 15 19 
15 12 13 13 16 16 17 15 16 24 25 25 20 22 35 26 29 28 36 
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individual of top rank, for example, comprises the five highest rank- 
ing persons, and so on until the twentieth hypothetical individual of 
lowest rank includes the five persons of lowest rank. The reasons for 
grouping the one hundred persons into these twenty hypothetical in- 
dividuals were to minimize the labor involved in the method of paired 
comparisons and to insure greater stability in the judgments of the 
items. The judgments of each of the items on the twenty hypothetical 
individuals were obtained. For example, in Table 1, item one judges 
hypothetical individual two (where four persons out of five were suc- 
cessful) greater than individual one (with three persons answering 
the item correctly). Similarly, in the case of item two, individuals 
one and two are judged equal, for two persons were successful within 
each. Comparing individuals one and three in regard to item ten, in- 
dividual one (with three persons successful) is judged greater than 
individual three (where only two persons were successful in answer- 
ing the item). An omitted item, as well as a wrong answer, was 
counted as unsuccessful. 

Then each hypothetical individual was used as a standard and 
compared with every other one in regard to success on each item, and 
the frequencies of greater, less, and equal tabulated in Table 2. For 
example, individual one has been judged greater than individual two 
on thirty-four items out of the hundred, individual one is judged less 
than two on fifteen items, and they are judged equally successful on 
fifty-one items. As in the field of mental test theory, the items “test” 
or judge the group of persons, while in the usual psychophysical prob- 
lem the reverse is true, the judges evaluating the stimuli or items. In 
Tabie 3, the frequencies translated into the corresponding proportions 
are given. The proportion of each as compared with itself was as- 
sumed to be .50 and the number of terms in the psychophysical table 
n(n— 1) /2. 

In this problem there are three possible ways of dealing with the 
intermediate category: dividing it proportionally to the greater and 
less categories, dividing it equally between them, and dividing it so 
as to split the base line of the equal category. These three methods 
may be described in more detail as follows: 

1. Dividing the equal category proportionally to the greater and 

the less categories. 


Where G means the number in the greater category, L is the 
number in the less category, and E the number in the equal 
category ; 


G 


GuL E=Value of the greater category when 


Gt 
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that amount of the equal category pro- 
portional to it has been added; 














L+ . E=Value of the less category with that 
G+L 
amount of the equal category propor- 
tional to it added; 
G 
a a — ‘sh 
G L  Gt+L’ 
Gta Etlts ra 


Dividing proportionally, then, gives the same results as disregarding 
the equal category entirely and considering the ratio of the greater 
divided by the greater plus the less categories. 

2. Dividing the equal category equally between the other two 





categories. 
This proportion takes the form: 
Gt+tE 
100 (2) 


This method gives a smaller value to the greater category than the 
preceding method where the greater category received the greater 
part of the equal category. This may be illustrated graphically. 





Mx: ° y 


The shaded area represents the equal category divided in such a way 
as to give the proportion EF to the greater category and the proportion 
E’ to the less. 

E:E'=G:L. (3) 


On the other hand, if the equal category is divided equally between 








50 


G and L, the division indicated by line a will have to move toward the 
mean of the distribution (dotted line b). Accordingly, the propor- 
tions and their corresponding sigma values will be smaller than in 
the case of dividing proportionally. 


3. Splitting the base line of the equal category. 


Referring to the figure illustrated above, the base line xy of the shad- 
ed area is to be equally divided such that xo = oy. While dividing 
equally makes the two parts of the intermediate category equal in 


Column 


FIGURE 1 
Sigma Values of Column One Plotted Against Those of Column 
Two Using the Proportion Obtained by Dividing Equally 


terms of proportion or area of the normal probability curve, dividing 
by splitting the base line of the equal category makes them equal in 
terms of the corresponding sigma values. This latter method will 
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: ; . = 
In this case, the sigma value corresponding to P = —— is 


added to the sigma value corresponding to P = ——— and the 


two are averaged. 


+E 100 t 


100 
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give the larger sigma value to the greater category than in the case 
of dividing equally. 

Two tests were applied to determine which of the three types of 
proportion is best adapted to this problem. 

1. (Test of linearity by inspection—Figures 1-4) 

As a rough check, the trend of linearity was determined by in- 
spection. Using sigma values corresponding to the type of propor- 
tions dividing equally, sixty-two graphs were plotted. Each column 
of sigma values, corresponding to the 2,, values of the Law of Com- 
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FIGURE 2 
Sigma Values of Column Seven Plotted Against Those of Column Nine 
Using the Proportion Obtained by Dividing Equally 


parative Judgment, was plotted in turn against each of the three fol- 
lowing columns. Column number twenty was plotted against columns 
one, two, and three respectively. Figures 1 and 2 are samples chosen 
to demonstrate the linearity obtained using the type of proportion 
secured by dividing equally, as superior to that of the linearity re- 
sulting from the other two methods to be discussed. Similarly, twenty- 
seven graphs were constructed for the sigma values of the proportion 





eae 


52 


PSYCHOMETRIKA 


dividing proportionally, using in this case only immediately adjacent 
columns and a few other cases in which linearity was noticeably good 
or relatively poorer in the case of the previous graph. (As an ex- 
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FIGURE 3 


Sigma Values of Column One Plotted Against Those of Column Two Using 
the Proportion Obtained by Dividing Proportionally 


ample, see Figure 3.) Comparing the two sets of graphs, a majority 
of cases resulted in which the proportion obtained by dividing equally 
gave plots which were definitely more linear than dividing propor- 
tionally. In a smaller number of cases the two proved about equal. It 
was found by inspection also that the slopes of these plots closely ap- 
proximated unity. This finding will be used later in verifying the ap- 
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plicability of Case V of the Law of Comparative Judgment to these 
data. Using some cases in which linearity was relatively superior or 
inferior in the case of the above-mentioned proportions, and others 
representing about equal linearity, eight graphs were plotted, using 
sigma values for the proportion obtained by splitting the base line of 
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FIGURE 4 
Sigma Values of Column Twenty Plotted Against Those of Column One Using the 
Proportion Obtained by Splitting the Base Line of the Equal Category 


the equal category. All plots obtained by this latter method were less 
linear than in the cases of the other two methods used. Here the plot 
tended to be curvilinear rather than linear (Figure 4). The method 
of dealing with the intermediate category by splitting the base line 
of the equal category was accordingly dropped as not worth while for 
further investigation. 

2. Discrepancy between variance and covariance— 

A more rigorous test applied in finding the best proportion for this 
particular problem, using the method of least squares, was to find the 
discrepancy between the variance and twice the covariance of any 
two columns of sigma values. Finding the discrepancy of the vari- 
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ance and twice the covariance should be a better test than finding the 
correlation coefficient, where the formula y = ax + b holds, because 
in this particular problem the slope of any two columns plotted is a 
slope of unity and the value of a in the equation is one. Then, using 
x and y as deviations from the mean of each column of sigma values, 


y=x+b. (4) 
But b = M, — M,= 0; and using the new origin, 
X(y— x«)?=0. (5) 
Expanding and transposing, 
LV + PY H2*> xy. (6) 


Results in comparing the proportions dividing proportionally and di- 
viding equally, using sigma values, show the following discrepancies: 
(The columns were chosen which seemed by inspection of the plots 
to be about equal in linearity for each of the two types of proportions. 
Two samples from the middle of the range of individuals, one at the 
end, and one at the beginning, are used.) 
Discrepancy 
Dividing Dividing 
Equally Proportionally 


Columns 4 and 3 compared... .0893 .6700 
Columns 11 and 12 compared... ._ .1588 1.0239 
Columns 10 and 11 compared... .__.1262 .4490 
Columns 16 and 19 compared... . . .0279 -7873 


In this more rigorous test than the method of inspection, stated above, 
again, dividing equally showed the less discrepancy and proved the 
better proportion to use in this particular problem. Accordingly, 
the equal category was assigned, fifty per cent to the greater and fifty 
per cent to the less category, and the final scale values were there- 
fore determined on the basis of this proportion. The sigma values re- 
sulting from this procedure are presented in Table 4. 


Thurstone’s Law of Comparative Judgment (5) may be stated as 





So — Sa= XraVor? + 00? — 2a» casr . (7) 


In this study, Case V, the simplest form of the Law of Comparative 
Judgment, was tested for applicability. This form, in addition to the 
assumptions of normality of discriminal processes, of approximate 
equality of the discriminal dispersions, and of zero correlations be- 
tween judgments of any two stimuli, when applied to a group, assumes 
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that the discriminal dispersions are equal. Formula (7) then be- 
comes 


Sp = fmt OS V/ 20° > (8) 


where S; is the scale value of stimulus 1 (hypothetical individ- 
ual 1). 

S. is the scale value of hypothetical individual 2. 

X», is the sigma value corresponding to the observed pro- 
portion of judgments b > a. 

os refers to the discriminal dispersions (variability) as- 
sumed equal for all stimuli (individuals) in this 
problem. 


In assuming Case V for the solution of this problem, we must 

provide some sort of experimental check for two assumptions: first, 
normality of the distribution of discriminal processes (of the vari- 
ability of the individuals) and second, equality of the discriminal dis- 
persions. 
Professor Thurstone (9) has suggested that a check for these two 
assumptions may be graphically determined by plotting the sigma 
values of proportions b > a of any two adjacent columns. If the plot 
is linear, the assumed normal distribution of discriminal processes is 
correct; if there is a slope of unity, the discriminal dispersions are 
equal. This check has already been utilized when these sigma values 
were plotted against each other in selecting one of the three ways of 
dealing with the intermediate category. This means that in this prob- 
lem, since there was linearity and unity of slope, the variability of 
the hypothetical individuals is equal throughout and the distributions 
are normal, and therefore Case V may be used. 

Having satisfied the assumptions of Case V, the solution of the 
problem is continued with this case. 

Using the subscript k to signify each stimulus in turn compared 
with the standard, 


Sp, — Sse=Xu V20; (9) 
S. ne: Si. ——e V2~e ; (10) 


Subtracting and summing for all values of k , dividing by n, and let- 
ting the sigma value be the unit of the scale, 


X oa oan 
s,-s,—2! bk i) V3. (11) 





n 
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In applying formula (11) to the problem, the quantity (Xu. — Xax) 
for each value of k is entered in a table of differences for each two ad- 
jacent individuals as stimuli; that is, d,., d.3 as given in Table 5. 
The mean of each column of differences is calculated as a better meas- 
ure than any one of them, and each mean difference is multiplied by 
V2 as Case 5 requires. This value Mai:r.\/2 gives the final scale sepa- 
rations in terms of the standard deviation of the discriminal disper- 
sions (variability of the individuals). Then the scale is built up, a 
value of zero being assigned to the lowest scale value and these values 
being accumulated (Table 6). 


TABLE 6 
Unweighted and Weighted Scale Values 


Unweigited Scale Weighted Scale Unweighted Scale Weighted Scale 
Values Values Values Values 
Eee. ow a ieee a «ws ks sp 
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The question arises as to whether the use of weighted values 
might differ significantly from the unweighted. An adaptation of the 
Miiller-Urban weights (2) was used, similar in principle to Thurs- 
tone’s weighting formula (10), which weights inversely to the square 
of the standard error, or in proportion to the reliability of the original 
value. 


The weight of a difference d;, is given by the formula 
1 
Wu = c a Se (12) 
+ iene 
We Wo 





in which W,;, and W,, are the Miiller-Urban weights corresponding to 
the proportions P,;, and P,,, respectively, k meaning any other stimu- 
lus. Multiplying each difference down the column by its appropriate 
weight, the weighted differences are found, and their mean difference 
calculated and multiplied by \/2 . 

Now Case V becomes 
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Swo — Swa= Xwra V2 (13) 


Then the scale is built up as in the case of the unweighted values 
(Table 6). 

In testing for the significance of the difference between the 
means of weighted and unweighted scale values, several procedures 
have been considered. A test of significance directly comparing the 
two sets of scale values (weighted against unweighted) was not pos- 
sible because the arbitrary origin of each scale affects the numerator 
of the resulting critical ratio and because of the accumulation of 
differences as the number of scale values which one happens to have 
increases. In view of this, the method used in this study is that of 
finding the significance of the difference of the mean differences of 
the respective columns before they are accumulated into scale values. 
Fisher’s method of handling the differences directly was used. This 
method is applicable when the variables are correlated, and corrects 
for a small sample. The critical ratio is 2.2, being significant only at 
the five per cent level. Thus the results are not significant under the 
crucial tests at the one and two per cent levels. On this basis, the 
weighted scale was discarded as not worth the trouble of weighting, 
and the unweighted scale values adopted. Since there were no pro- 
portions P,,, above .97 or below .03 in using the proportion dividing 
equally, all proportions in the table were retained and given equal 
weight. 

For the proof of internal consistency of the scale, there are three 
checks presented in this problem, one based on an inspection of the 
columns of differences, one based on the mean of the discrepancies 
between the actual and the theoretical proportions, and a test for 
stability. 

Professor Thurstone has suggested a quick check for proving in- 
ternal consistency by inspecting each column of differences derived 
from sigma values (Table 5). These x distances should hang together; 
that is, there should be no systematic drifts in values up or down the 
column. On inspecting Table 5, we see that no column gives any 
systematic increase or decrease in values. However, this is only a 
rough check. 

The real proof for internal consistency which Professor Thurs- 
tone has provided (8) is that based on the algebraic mean of the dis- 
crepancies and the average discrepancy, disregarding sign between 
actual and theoretical proportions. Starting with the final scale 
values and working backward to the theoretical proportions demanded 
of them, the differences between these values and those of the original 
proportions were obtained and the mean of these discrepancies calcu- 
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lated to be .00018. The distribution of the discrepancies has been 
plotted and the results shown in Figure 5. The distribution proves to 
be symmetrical. The average discrepancy, disregarding signs, is .014. 

Another check for the reliability of the scale which can be applied 
in this particular problem is the test of stability. Stability in this 
situation means that the scale distance between any two individuals 
taking a test remains unchanged no matter in what new group of 
individuals they are placed. In this test, four hypothetical individuals 
were chosen at random from among the twenty hypothetical individ- 


TABLE 7 
Test Items Answered by Thirty Additional Cases and Twenty Original Persons 


in the New Data for the Test of Stability 
(The hypothetical individuals 1, 3, 8, 9 were taken at random from the original 
scale. Their ranks in the original scale were 1, 6, 17, and 18 respectively) 
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uals originally scaled, to be placed along with six new hypothetical 
individuals forming a new scale. For these six new individuals, the 
test records of thirty new persons who took this same test were 
selected at the Board of Examinations in the same manner as those 
originally chosen, that is, by alphabetical order of last names. These 
were again ranked in terms of final scores and grouped by fives, form- 
ing six hypothetical individuals. Then the four hypothetical individ- 
uals of the old scale were slipped in among them, all being ranked 
according to the rank of the raw scores, as is shown in Table 7. 
Then the new scale was built up, using the four individuals from the 
original scale and the six from the new data. 

Since the origin of the old and the new populations comprising 
this new scale are arbitrary, one can not expect the scaled scores of 
these four individuals to be the same in the new scale as they were in 
the old one. However, if the scale is stable, the distance between 
them will be the same in both scales. Results clearly indicate, as is 
shown in Table 8 and Figure 6, that the scale distance between any 
two of the individuals common to both scales is the same on the new 
scale as in the old, and thus that the scale has a high degree of 
stability. 


TABLE 8 


Record of Distances Between Any Two Hypothetical Individuals Common to 
Both Old and New Scales in the Test for Stability 
Hypothetical Individuals Common to Both Scales 


Rank in Old Rank in New 
Scale Scale 
a SS. 8 © 6) SRO ew Sw « 8 
2 | a ee Moeyuqoweg 


The distance between any two of the four 
hypothetical individuals should remain the 
same if the test for stability holds. (The 
mean differences of both scales are used 
in the comparison.) 


M,1-8 should equal M, 1-6 . . . . .4910 = .4880 
M,8-8 should equal Mz 6-17 . . . . .6689 = .6915 
M 8,9 should equal M,17,18 . . . . .1896 = .1893 


The scaled scores for the hypothetical individuals, in both the 
old and new sets of data, were plotted against their averaged raw 
scores, respectively, and the plots found to be fairly linear. The 
question may be raised as to what the use of building up a scale may 
be, if plots of the raw and the scaled scores tend toward linearity. 
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However, there are basic differences between the scaled scores and 
the raw scores. The scaled scores have fulfilled the mathematical 
additive function. Therefore, they are based on a rational function. 
On the other hand, the raw scores are not. It may be pointed out that 
the linearity for the data of this study may be attributable to chance. 
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FIGURE 5 
Distribution of the Discrepancies Between Actual and Theoretical Proportions 


CONCLUSIONS 


1. This experiment, using the paired comparison method, has 
proved that it is possible to scale individuals taking any mental and 
educational test. Thus, it is possible to utilize the psychophysical 
methods in the mental test situation. 

2. In this particular problem it was found that, of three possible 
ways of dealing with the intermediate category, the method of using 
the proportion obtained by dividing this category equally between the 
other two was most satisfactory. It may be said that while the two- 
category judgments are preferred to three-category judgments in the 
ordinary psychophysical problem, it is impossible to use only two 
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categories here. We cannot avoid the fact that in the cases where 
two hypothetical individuals have the same number of successful 
persons they must be judged equal. 

8. In applying Case V of the Law of Comparative Judgment to 
the data of this study, the two assumptions of normality of the distri- 
bution of the discriminal processes (variability of the individuals) 
and of equality of the discriminal dispersions have been actually sub- 
jected to test, and the assumptions have been found to be completely 
valid for these data. 

4. The weighted scale and the unweighted scale were com- 
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FIGURE 6 
Distances Between Any Two Hypothetical Individuals in the Old Scale Plotted 
Against the Distances Between the Same Individuals in the New Scale 


pared, and the difference between the mean differences was found to 
be insignificant. 

5. For the proof of internal consistency three tests were applied, 
one based on observational check on the columns of differences, one 
based on the algebraic average discrepancy and the average dis- 
crepancy, disregarding sign, and a test of stability. The results clearly 
indicate that the scale is internally consistent. 

6. The test of stability here introduced may be said to confirm 
Professor Thurstone’s method of internal consistency. So far as the 
writer knows, this type of test has not heretofore been attempted. 
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A METHOD OF ESTIMATING ACCURACY OF TEST SCORING 


WALTER L. DEEMER 
HARVARD UNIVERSITY, CAMBRIDGE, MASSACHUSETTS 


When errors of test scoring obey a Poisson frequency law (theo- 
retical considerations suggest that they do), the method described 
may be used for finding the upper fiducial limits of scoring errors 
per paper. A criterion is suggested for establishing tolerance limits 
on scoring errors, and a method is given (1) for finding the prob- 
ability of being wrong in the statement that the tolerance limit is 
being met for a given size sample or (2) for finding the size of sam- 
ple that will make this probability not greater than some fixed value. 


Most methods of scoring tests are liable to scoring error. Tests 
scored by test scoring machines are probably as little liable to error 
in scoring as any, but even such test scores may contain errors due to 
faint markings, or to stray dots on the answer sheet, or to misread- 
ing of the dials on the machine. The number of scoring errors in man- 
ually scored tests may be relatively large, and in a project of any size, 
where many tests are to be scored, it is often desirable that samples 
of scored tests be rescored in order to estimate the number of scoring 
errors that are being made. An estimate of scoring errors is neces- 
sary in evaluating later findings, and such an estimate may make it 
possible to make adjustments in the scoring methods if the sample 
indicates that too many errors are being made. This paper deals with 
the problem of estimating the number of scoring errors in a set of 
papers from the number of scoring errors found in a sample. 

The method is based on the assumption that the number of scor- 
ing errors per paper follows the Poisson frequency law. From a 
priori considerations this seems a reasonable assumption for most 
tests, since the number of items is generally large and the probability 
of making a scoring error on any item is small and presumably con- 
stant from item to item. For any given test situation the observed 
frequency of scoring errors should be tested against the Poisson dis- 
tribution, using a chi-squared test. 

The number of opportunities for making errors in scoring may 
be different from the number of items in the test. If the mean number 
of scoring errors per paper is x and there are n items in the test, the 
assumption that the probability of making a scoring error is 


q=x/n (1) 
65 
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may not be valid. The best method of estimating q (the probability of 
making an error) and n (the number of opportunities for making 
scoring errors) is to compute the mean, x, and the variance V, for the 
sample, and use the relations 


q=1-V/x (2) 
n=2z?/(%—V), (3) 
which are found by solving the formulas for the binomial, 
x= nq (4) 
V=nq(1—4q), (5) 


for n and q. 
It is clear that the binomial estimated from (2) and (3) will, 


for x < V, be of the form 
[=¢ + ¢2 + @)]*, (6) 


which is called by Whitaker (9) and Pearson (4) a “negative bi- 
nomial.” 

Whether the binomial estimated from (2) and (3) is close 
enough to the Poisson distribution to warrant the assumption that the 
population is distributed according to the Poisson law is discussed by 
Whitaker (9) and “Student” (7). Pearson (4) gives a method of 
resolving a series giving a negative binomial into a sum of two Poisson 
series. The rest of this paper is based on the assumption that the fit 
of the data in hand to a Poisson distribution has been found satisfac- 
tory. 

Consider an example of the problems that arise in practice when 
we are trying to estimate the number of scoring errors that have been 
made in a set of papers. Let N; be the total number of papers in the 
set, and N, be the number of papers in the sample rescored. If x 
scoring errors are found in the N, papers, what estimate, m,, may be 
made of the upper limit of the mean number of scoring errors in the 
N; papers, if a certain probability of being wrong in the estimate is 
acceptable? 

The usual method of fiducial inference leads to two estimates of 
a parameter, an upper limit and a lower limit, such that in the long 
run of trials the statement that the true value of the parameter lies 
between these two limits will be wrong less than 100 p% of the time, 
where p, between 0 and 1, is called in this paper the level of signifi- 
cance (1 — p is sometimes called the fiducial coefficient or the confi- 
dence coefficient) . 
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Since we are interested here only in the upper fiducial limit, 
the problem is somewhat modified. The parameter we are interested 
in is the mean number of scoring errors per paper in the population. 
We shall denote this parameter by m. We wish to find a value of m, 
say m,, such that any hypothesis that m > m, may be rejected at the 
chosen level of significance. Stated another way, we want to find a 
value of m, say m, , such that the statement that m < m,, will be right, 
in the long run, 100(1 — p)% of the time. The value of p is at the 
choice of the experimenter, but as p is made small m, becomes large. 

The problem of estimating the upper limit of the number of scor- 
ing errors in a set of papers is thus seen to be one of finding fiducial 
limits for the parameter, m, of the Poisson distribution. A number 
of papers have dealt with this problem. Ricker (6) and Garwood (2) 
consider the problem of finding upper and lower fiducial limits for the 
parameter of the Poisson distribution. Przyborowski and Wilenski 
(5) have considered the problems which arise in finding upper fidu- 
cial limits for the Poisson distribution. 

The following theorem therefore proves nothing new, but it 

makes the problem of setting up fiducial limits for the Poisson some- 
what easier to follow. 
Theorem i. If the number of scoring errors per paper is distributed 
according to a Poisson law with parameter m, then the sum of the 
number of scoring errors in a sample of size N, will obey a Poisson 
law with parameter mN, (see Uspensky (8), p. 279). This may be 
proved as follows. The Poisson distribution is 


em 
f(x) =m ar (7) 
The characteristic function of the Poisson is defined as 
ioe) m* 
$(t) Sem S ott (8) 
= e-m(cos t + i sin t-1) * (9) 


Since the characteristic function of a sum is the product of the char- 
acteristic functions, we get for the characteristic functions of the 
sum of a sample of size N,: 

e-Nem(cos t +4 sin t-1) (10) 


and, since “the distribution function of probability is uniquely deter- 
mined by the characteristic function ” (8, p. 271), the theorem is 
proved. 


* The writer is indebted to Dr. L. Alaoglu for suggesting that to evaluate the 
sum in (8) let z= me?!, 





68 PSYCHOMETRIKA 


This theorem gives us the information we need to find the upper 
fiducial limit of the parameter of the Poisson distribution. Let x be 
the mean number of scoring errors per paper found in the sample. 
The probability of getting a sample of N, papers giving N, x or fewer 
scoring errors from a population with parameter m, is, by Theorem 1, 





Nf @-M:N, (N, m,)* 
P= > = (11) 


If p. < p, where p is the level of significance, the hypothesis that 
m > m, is rejected. If p, = p, we call m, the upper fiducial limit and 
denote it by m,. 


TABLE 1* 


Upper fiducial limits, m’, of scoring errors 
per sample of size N, 











Np 0.50 0.10 0.05 0.02 0.01 0.005 0.001 1» 
0 0.7 2.3 3.0 3.9 4.6 5.8 6.9 0 
1 ae 3.9 4.7 5.8 6.6 7.4 9.2 1 
2 2.7 5.3 6.3 7.5 8.4 9.3 11.2 2 
3 3.7 6.7 7.8 9.1 10.0 11.0 13.1 3 
4 4.7 8.0 9.2 10.6 11.6 12.6 14.8 4 
5 5.7 9.3 10.5 12.0 13.1 14.2 16.5 5 
6 6.7 10.5 11.8 13.4 14.6 15.7 18.1 6 
7 acd 11.8 13.1 14.8 16.0 17.1 19.6 7 
8 8.7 13.0 14.4 16.2 17.4 18.6 21.2 8 
9 9.7 14.2 15.7 17.5 18.8 20.0 22.7 9 
10 10.7 15.4 17.0 18.8 20.2 21.4 24.1 10 
11 11.7 16.6 18.2 20.1, 21.5 22.8 25.6 11 
12 12.7 17.8 19.4 21.4 22.8 24.1 27.0 12 
13 13.7 19.0 20.7 22.7 24.1 25.5 28.4 13 
14 14.7 20.1 21.9 24.0 25.5 26.8 29.9 14 
15 15.7 21.3 23.1 25.2 26.7 28.2 31.2 | 15 
20 20.7 27.0 29.1 31.5 33.1 34.7 38.0 20 
25 25.7 32.7 34.9 37.5 39.3 41.0 44.6 25 
30 30.7 38.3 40.7 43.5 45.4 47.2 51.1 | 30 
35 | 35.7 43.9 46.4 49.4 51.4 53.3 57.4 35 
40 | 40.7 49.4 52.1 55.2 57.4 59.4 63.7 40 
45 | 45.7 54.9 57.7 61.0 63.2 65.4 69.8 | 45 


50 | 50.7 60.4 63.3 66.7 69.1 71.3 76.0 | 50 





b= number of scoring errors found in a sample of size N,. 

p = level of significance; if the statement is made that the mean number of 
errors per paper, m, in the total set is less than m’/N,, the statement 
will be true in the long run of trials, 100(1—p)% of the time. 


* Except for the column for p — 0.50, this table is condensed from a table 
given by J. Przyborowski and H. Wilenski (5, p. 288). 
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The expression (11) may be evaluated by use of the following 
relationship. Let N,m, = m’ and N,x = b; then 


ae 1 a eee ce) 
oi hal (m’) =a, J wm erae, (12) 


as may be shown by integrating the right-hand member by parts. The 
value of the right-hand member may be found in Tables of the Incom- 
plete Gamma Function, edited by Karl Pearson. Values of m' (=N,m,) 
for 7 values of p and for values of b(=N,x) from 0 to 50 are given in 
Table 1, which has been condensed somewhat from 5, Table V. The 
column for p = 0.50 is not given in 5. 

Example 1. Assume that in a sample of N, = 3, one scoring error 
is found. Then b = 1. Say p has been chosen as 0.01; then Table 1 is 
entered with p = 0.01 and b = 1, and the tabled value is m’ = 6.6. 
This means that the hypothesis that N,m > 6.6 is rejected at the chos- 
en level of significance, but that the hypothesis that N,m < 6.6 may 
not be rejected. The upper fiducial limit of scoring errors per paper 
is therefore m, = 6.6/N, = 2.2. If we say that the mean of the Pois- 
son distribution of which we have a random sample is less than m,, 
we shall be wrong, in the long run, in not more than 1% of trials. 

As N, is increased, the upper confidence limit is decreased for a 
given x. Thus, if x remains 1/3 but N, is increased to 12, we enter 
Table 1 with » = 0.01 as before, but b is now 4. We find m’ = 11.6, 
giving m, = 11.6/12 = 0.967. This means that the hypothesis that 
m > 0.967 may be rejected at the 1% level, or, in other words, that the 
statement that m < 0.967 will be false, in the long run, not more than 
1% of the time. 

The factors to consider in deciding on size of sample will be the 
relative importance of having m., accurately determined, compared to 
the cost of rescoring. The minimum value of N, will be one that will 
give for b = 0 a value of m, < T, where T is the maximum number 
of errors per paper considered acceptable. 7 will be called the toler- 
ance limit of scoring errors per paper. This minimum value of N, 
will be denoted by N’,. It is found as follows. Enter Table 1 with 
b = 0 and p equal to whatever value has been chosen for the level of 
significance; the tabled value, m’, is equal to N’,m,. Since we want 
m, < T we have 


N’, >m'/T. (13) 


Example 2. If T is 2, and p= 0.01 we find m’ = 4.6 and N’, > 
4.6/2 = 2.3. The next largest integral value is taken when N’, is not 
integral, and 3 papers would therefore be drawn at random from the 
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set of N; and rescored. If no errors in scoring were found in these 
three papers we would be confident at the 1 — » level that the toler- 
ance limit T was not being exceeded. 

Example 3. If one error is found during the scoring of the three 
papers taken in Example 2, m’ changes. The new value of m’ is found 
by entering Table 1 with the same p , but with b now equal to 1, since 
b is the number of scoring errors found in the N, papers rescored; 
m’ is found to be 6.6. Hence m, is 6.6/3 = 2.2 > T. There are now 
two courses open: (1) we may assume that the tolerance limit T is 
not being met and therefore change the method of scoring to make it 
more precise, or (2) we may take a larger sample in order to get a 
more precise estimate of m, before making any inferences about 7. 
If the scores of the group and not the individuals are the important 
thing, and if we may assume that no scoring errors remain in a paper 
after it has been rescored, we may score enough papers so that for 
the total group of N; papers the upper estimate of the number of er- 
rors per paper is not greater than 7. If we denote this upper limit 
for the whole set (after N, have been rescored) by m, , we have 


My = M, a) . (14) 


If we set m,~ < T and solve for N, we get 


m N; 
dial °K, Oe onl 
Hence we see that, if we are interested only in having m, < T with- 
out respect to the size of m, , we may always rescore enough papers to 
satisfy .this requirement. But even in this case, m, should be ex- 
amined to see if it is so large as to indicate a possibility that more 
accurate scoring is possible. 

Criteria for Choosing T. It seems clear that the value of T should 
depend to some extent on the variability of the scores in the group. If 
the variance of the scores is large, a slight error in a score due to 
scoring errors will not be so important as the same error when the 
variance of the scores is small. 7 may also be a function of the reli- 
ability of the test, as the more reliable the test the more important 
is a given change of score due to scoring errors. Professor T. L. Kel- 
ley has suggested to the writer that the most valid criterion of T will 
generally be some function of the standard error of a test score. If 
s is the standard deviation of the scores in the set and , is the reli- 
ability coefficient of the test, we have for the standard error of the 
test score 








~~ m= 45 TM 
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S=sV1l-r. (16) 


If the reliability coefficient is not known when the scoring is being 
checked, it may be estimated. 

In line with a proposal about significant figures (3), Professor 
Kelley suggests that the most serviceable criterion would be one in 
which the median number of scoring errors was 1/3 §,. 

This amounts to choosing 


T=1/3 s (17) 


and using p = 0.50. If we choose a higher significance level, T would 
be increased. If the distribution of scoring errors were normal, the 
approximate equivalent of p = 0.50, T = 1/3 s., would be p = 0.05, 
T=1.0s.,orp—0.01,T=1358.. 

Example 4. The approximate s, for the Paragraph Meaning Test 
of the Stanford Achievement Test Advanced Battery is 3.7 items; 
T = 1/3 s.= 1.2. If one paper were chosen at random from a set and 
no scoring errors were found, we would be satisfied that the tolerance 
limit was being met, as Table 1 gives m’ = 0.7 for b = 0, p = 0.50; 
hence m, = 0.7 < 1.2. 

Intuitively we may feel that 0 errors in a sample of 1 is not suf- 
ficient evidence that m , the mean number of errors in the total set of 
papers, is not greater than 1.2; it should be kept in mind that since 
we have used p = 0.50, our evidence is only that in half the trials 
m < 1.2. Reference to the column headed p = 0.10 in row b = 0 shows 
that in 10% of trials m > 2.3, and in 1% m > 4.6. Some workers may 
no doubt feel that a more rigorous criterion, say T = 1/3 s, with 
p = 0.05 or p = 0.01 will more nearly fill their requirements in esti- 
mating scoring errors. 

If b(=N,x) > 50, Table 1 may not be used. In this case only a 


slight error will result if it is assumed that x is distributed normally 
about mean m, with variance m,. (In the Poisson distribution, the 
mean equals the variance). The usual method of estimating upper fi- 
ducial limits for the normal curve gives 


m= +2/, (18) 


where z is the distance in standard deviation units from the mean to 
the point cutting off a tail containing 100 p% of the area of a normal 
curve. Solving (18) for m, we get 


2 


ee | Oa ee ee (19) 
My=2 N, 4N?, a ON. 
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or 
ag 2 oe 
m=2,|2N,+7+(2N.+3). (20) 


Table 2 gives a comparison of the values of m’ from Table 1 and 
from the normal curve using (20) for b(=N,2) = 50. 











TABLE 2 
9: 0.50 0.10 0.05 0.02 0.01 0.005 £0.001 

om from 

Table1 | 50.7 604 63.3 66.7 69.1 71.3 76.0 
b=50 | 
m' from | 

(20), | 50.0 599 631 668 694 718 7.1 
b= 50 








It is seen that use of (20) for b= 50 leads to under-estimation of 
m’ only when p > 0.02, and is only slightly under for p = 0.05. 

The larger the set from which the sample is taken, the smaller the 
proportion of papers needed in the sample in order to satisfy (15). 
For example if N; = 10, b = 1, p = 0.05 and T = 0.48, we find from 
Table 1, m’ = 4.7. By (15) this gives 


N, = (4.7) (10) /[ (10) (0.48) + 4.7] 
= 4.9, 


meaning that half the total number of papers would have to be re- 
scored. 


If all the foregoing figures remain the same except that N; is 
100, (15) gives: 


N, = (4.7) (100) /[ (100) (0.48) + 4.7] 
= 8.9, 


and only 9 % of the papers need to be rescored. It is clear, therefore, 
that the scoring procedure should be so arranged that the probability 
of scoring errors will be constant for as large a number of papers as 
possible. In general, it will not be valid to combine papers scored by 
more than one scorer, as the probability of errors is likely to vary 
from one worker to another. If fatigue affects scoring accuracy, it 
may not be safe to assume that all papers from one scorer for a single 
long scoring period have the same probability of scoring errors. There 
are thus limits on the size of set which can be considered homogeneous 
with respect to the probability of scoring errors. Homogeneity may 
be tested by finding the binomial from the expressions (2) and (3) 
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and seeing if q is small and 7 large. For a discussion of this problem, 
see Whitaker (9). If q > 0.01, the assumption that the population is 
distributed according to the Poisson law may not be sound. It may 
then be better to assume that the population has a binomial distribu- 
tion. Upper fiducial limits for the binomial may be found from charts 
given by Clopper and Pearson (1, pp. 410 and 411). 

An examination of the size of m, may indicate that the methods 
of scoring are not sufficiently accurate, even though by using samples 
of size N, based on (15) we are confident that the tolerance limit is 
being met for the set as a whole. In this case, economy of scoring may 
perhaps be secured by having the original scorers work more slowly, 
if that will increase accuracy, particularly if rescoring costs more per 
paper than the original scoring. This may be the case, for example, 
when the original scoring requires that the papers be marked. 





The writer takes pleasure in expressing his gratitude to Profes- 
sor T. L. Kelley for suggestions regarding some of the problems that 
arose in connection with this paper. 
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